The current gene mapping for complex diseases is heavily weighted by studies of population samples from northern Europe. To capture the full range of genetic diversity and exploit the potential of genetic epidemiology to identify important variants, multiple additional populations will need to be examined. The conduct of genome-wide association studies will therefore confront many of the challenges identified in the first generation of candidate gene and linkage studies, with a substantial increase in complexity. Initial efforts to map causal effects will have to take account of varying patterns of linkage disequilibrium through careful attention to local haplotype structure. Refined statistical techniques that permit joint analyses of samples from multiple populations will also be required, as well as improved methods to account for on-going gene flow between populations with geographically distinct ancestral origins. This variation can either be an impediment, slowing the process of replication, or an opportunity, allowing finer dissection of the relevant variants. Clinical translation of these data will present major challenges. Large cosmopolitan populations, such as those found in large urban centers, are likely to exhibit both known and cryptic sub-structure across groups, as well as admixture within individuals. Great care will need to be devoted to generalizability of association findings to avoid their premature adoption as predictive tests in the face of this widespread heterogeneity.
A broad understanding of the implications of inter-individual genetic variation and risk of common illness will require the inclusion of populations with a range of geographic origin and disease burden. Given the current limited state of knowledge regarding the genetic architecture that underlies common traits, it is not yet possible to make straightforward inferences about the generalizability of markers across populations. However, assuming that a mixture of common and rare/uncommon variants will be involved, and further assuming that many genes will confer susceptibility, the overall heterogeneity will be substantial. Furthermore, as recognized in the extensive literature on linkage disequilibrium (LD) mapping for candidate genes, heterogeneity across populations will have substantial implications for initial efforts to localize variants. In addition to the existence of population structure across geographic ancestry, modern cosmopolitan societies give rise to extensive gene flow resulting in individual-level admixture. Finally, from a public health perspective, the disease burden experienced by sub-populations defined by ethnicity, geographic origin and so on often varies both within and between countries. Genetic studies will therefore have more relevance for groups with high disease burden.
Most of the issues that confront the application of genome-wide association (GWA) studies have emerged in prior research, and most have already been addressed with varying degrees of success. The most notable change, of course, is the sheer magnitude of the challenge given the rapid growth in verified risk loci and the potential number of sub-populations. A particular concern is the need to use the collective scientific resources in a cost-effective manner. A useful approach to this new phase might involve a strategy to balance the mere existence of an opportunity with an assessment of the potential value of the new knowledge that can be obtained. Without some system of prioritization, the risk exists that studies will proliferate without direction or careful attention to the utility of the additional information. This challenge will certainly apply when decisions are made, regarding the need for studies in multiple populations beyond those where the original discovery was made.
We review below some of the key issues that will be encountered as GWA studies are put to use across human sub-populations.
With the advent of high-throughput chip genotyping technology, population structure can be detected with very high resolution ( 1 ). Results from the Human Genome Diversity Panel (HGDP) suggest that one dimension of genetic structure in human population falls along geographic/continental lines. Similar evidence was derived from a study of 1056 individuals from 52 populations using 377 microsatellite markers ( 2 ) as well as a follow-up examination of the same individuals using 650 000 common single nucleotide polymorphisms (SNPs) ( 1 ). Although HGDP is not a random sample of the world’s populations, the results are consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. The HGDP tended to sample from smaller, more isolated groups and it is still too early to know how well the description arising from the HGDP data set correlates with what would be found in larger out-bred population groups. Nonetheless, it is clear that geographic differentiation will have a significant impact on disease association studies. Self-identified ancestry may therefore be an effective proxy for population structure and therefore useful in association studies ( 3 ) for populations which have been relatively isolated. The meaning of this potential heterogeneity could likewise be of interest. Neutral variants will diverge due to random genetic drift, whereas functional variants may be under selective pressure due to adaptation to the new environments encountered during migration. Distinguishing these two sources of variation may be helpful in identifying true disease variants in association studies.
In addition to its effect on the frequency of true causal mutations, geographic differentiation can create analytic challenges when data from two or more populations are combined. The potential for confounding that can arise has been recognized for many years, and a variety of statistical methods have been developed to control for false associations in this setting ( 4–8 ). The extension of GWAs into more populations will likely mean that the significance of structure will now have to be investigated at a much higher level of resolution. The issue of population structure will become even more complex as the sample size increases ( 9 ). While it has generally been assumed that the HapMap provides a reasonable guide to coverage with tagging SNPs in the three regional populations studied in detail to date, a recent analysis based on extensive re-sequencing suggests this assumption may not be correct ( 10 ). Using near-complete data from 76 genes as a reference standard, it was found that even with the dense coverage provided by a gene chip, such as the Affymetrix 6.0, only about 45% of SNPs were tagged ( r2 >0.8) in the YRI sample (10). These findings demonstrate that the current commercial chips may miss important variants in association studies in African-origin populations. In fact, very few chronic disease studies have been carried out among African populations, and Yorubans have served as the prototype in most intensive surveys of African genomes ( 11–13 ). An on-going study sponsored by the NHGRI will add three additional populations from East Africa and offer important new insights into what must be a complex pattern of population structure within Africa. In fact a regional survey of Latin American populations has demonstrated an almost bewildering array of possible ancestral groupings where there is thought to be less geographic differentiation than in Africa as a result of a shorter history of human occupation ( 14 ).
Quantitative estimates of variation in haplotype frequency and structure have recently been obtained with the data in the HGDP ( 15 , 16 ) (Fig. 1 ). With similar reference data, it should be possible to test the consistency of the haplotype structure at the location of tagging SNPs that have been shown to be associated with disease risk. These data might provide insight into the generalizability of the associations that are based on proxy markers, rather than causal mutations. Given the sampling frame for this study, however, as noted, it is not clear how well these estimates apply to large national populations; for example, many of the samples from China were from minority groups ( 15 ). Analyses which attempt to resolve issues related to finer level structure will also have to devise new sets of ancestry informative markers (AIMs). On the other hand, given current technology, it may actually be more reasonable to use all the marker data available in GWA studies to infer ancestry information rather than AIMs, since the selection of AIMs is often biased to the known population structure.
The development of methods based on principal components analysis (PCA) will be a substantial help in confronting the problem of more subtle structure. For example, even within the same ethnic population, such as European Americans, detectable population stratification still exists ( 17–19 ). Principal components analysis methods may also make it possible to pool participants from different geographic/ancestral origins into a single analytic sample. When dense markers are available, such as those from GWA studies, we have recently demonstrated that a PCA of a marker matrix can eliminate spurious association due to stratification and allow pooling of individual level data from heterogeneous populations ( 4 ). Principal components analysis can use either AIMs or random markers without consideration of LD among the markers. Further, the PCA approach is more powerful than both the genomic-control method ( 7 ) and meta-analysis that combines P -values from individual studies and is less computationally intensive than the Structure association method ( 6 ). Combining data from geographically distant populations can definitely enhance fine mapping analyses. Experience with common variants in the structural gene associated with circulating levels of angiotensin-converting enzyme (ACE) demonstrated how pooling can increase power dramatically ( 4 ). Populations with a simpler haplotype structure, and thereby greater coverage with genome-wide tagging SNPs, can be useful to identify broad regions of interest. Contrariwise, populations with shorter and more numerous haplotypes can help localize influential variants. It must also be recognized, however, that pooling ethnic different samples can, under some circumstances, introduce additional noise since multiple population-specific variants (some of which may be rare) may affect the trait variation.
POPULATION HETEROGENEITY AND GENE MAPPING: AN AID OR AN OBSTACLE
As suggested above, heterogeneity in local genomic structure can limit the generalizability of tagging SNPs and make it difficult to replicate and extend disease association findings. At the same time, varying LD structure across study samples can assist in fine mapping and localization of putative causal variants. The renin–angiotensin system provides a useful test system for mapping causal variants and has now been exhaustively studied. Circulating levels of ACE are highly heritable and the effects map strongly to the structural gene on chromosome 17. At the ACE locus ∼100 variants have been identified, however, and isolation of the causal variant through observations in humans has been difficult. Among European populations there are essentially three extensive haplotypes at this locus, making it impossible to select the functional mutations. Within African-origin populations, however, a much larger number of haplotypes are found, and the extent of LD is shorter. Using this opportunity, we were able to identify three separate regions within this gene that independently influence ACE levels ( 20 ). Unfortunately, in this instance, it was not possible to unambiguously define the causal mutations. In other instances, where there may be less genetic heterogeneity, it might be possible to use cross-ethnic comparisons to isolate either a smaller region harboring mutations or the functional mutation itself. Our preliminary experience with well-validated obesity and height loci suggests that this strategy may in fact be useful.
The search for rare variants will almost certainly be influenced by the choice of the study population. In general, of course, since rare variants tend to be of more recent origin they are less likely to be cosmopolitan. Founder effects will of course influence the prevalence of specific variants, as observed for many mendelian traits. The presence of more rare or uncommon mutations in African-origin populations will also mean that influential rare variants are likely to be over-represented in these groups. A series of rare variants that influence lipids, for example, have been identified in African Americans and would not have been found in Europeans ( 21 , 22 ). Thus, in some instances, genetic heterogeneity can have specific analytic value, while deepening our appreciation of the full range of complexity.
Modern populations formed by the recent admixture of geographically divergent ancestral populations, such as African Americans and Hispanics, can be extremely useful in admixture mapping. The idea of admixture mapping is similar to traditional linkage analysis, where admixture mapping traces the ancestral chromosomes shared by affected individuals and linkage analysis traces the chromosomes shared in related pairs. By comparing the percentage of ancestral chromosomes in a specific genomic location with the average ancestry across the genome in affected individuals, it is possible to detect a region that may harbor causal variants. In addition, excess ancestry at a genomic location (defined as the difference between a locus-specific ancestry and the average ancestry across the genome) can be compared between cases and controls. The feasibility of this strategy has been demonstrated in various theoretical studies, including classical likelihood-based methods ( 23 ) and Bayesian approaches ( 24 ). These theoretical studies indicate that admixture mapping is potentially more powerful than traditional linkage studies when the population risk ratio between parental populations is high at a given locus; at the same time, the cost of genotyping for an admixture mapping study is generally lower than that of GWA studies. Because of the availability of AIMs, an initial set of studies in admixed populations have found correlations between individual ancestry (defined as the genome-wide average proportion of ancestry from a given population) and various phenotypes, including skin pigmentation ( 25 ), type 2 diabetes ( 26 ), high blood pressure ( 27 ) and prostate cancer ( 28 ), although we acknowledge that most still require confirmation.
The resolution of admixture mapping is generally higher than for linkage analysis but lower than for association. The region identified by admixture mapping can vary in length from several centiMorgans to >20 cM. With dense markers genotyped in GWA studies, SNPs in regions with signals from admixture mapping analysis can subsequently be tested for association. By limiting the search only to regions with the initial admixture mapping evidence, the number of statistical tests is much less than in GWA studies, thereby potentially increasing the power to detect association (in particular, reducing Type 2 error). In addition, an association analysis based on testing a marker panel cannot differentiate the causal SNP from proxies. However, testing whether an SNP explains the evidence identified by admixture mapping can provide better evidence on the potential causal role. The idea of searching for potential causal variants has been well studied in linkage analysis ( 29 , 30 ). Sun et al . ( 29 ) developed a statistical approach to test whether an SNP can fully explain the observed linkage evidence. Li et al . ( 30 ) further extended this to a method that can quantify the degree of LD between a candidate SNP and a putative disease locus through joint modeling of linkage and association. Theoretically, admixture mapping can be viewed as a linkage analysis with the entire admixed population as one family. Thus, approaches similar to those proposed by Sun et al . ( 29 ) and Li et al . ( 30 ) can be applied in admixture mapping to search the potential causal SNPs that can explain the admixture signal. Currently, many GWA studies are underway in African-American populations. For example, the Candidate-Gene Association Resource (CARe) project sponsored by NHLBI will conduct GWA with many traits using Affymetrix 6.0 platform, including ∼9500 African-American participants ( http://public.nhlbi.nih.gov/GeneticsGenomics/home/care.aspx ). This data set presents a tremendous opportunity for admixture mapping. Because of the availability of the dense marker set, it will be possible to search for potential causal SNPs that could be missed by GWA studies due to the high degree of multiple testing.
CLINICAL TRANSLATION OF GWA FINDINGS AND THE DILEMMA OF BIOBANKS
Repositories of biological samples are being developed in a wide variety of contexts and they offer a glimpse of the complex problems that await us in the phase of clinical translation ( 31–36 ). Many of these ‘biobanks’ will rely on clinical operations as a source of samples. In large metropolitan areas, like New York City and Los Angeles, the racial/ethnic background of patients in these systems is highly varied. A large medical facility in Manhattan recently implemented a procedure for assigning racial/ethnic identity into eight discrete categories, with an option for ‘other’ (E. Bottinger, personal communication). As is well recognized, some of these categories are particularly complex, such as ‘Hispanic’, where persons from Caribbean islands will be combined with immigrants from Central and South America. Considerable stratification may persist within these groups. Attempts to replicate the published findings for common disease in these contexts will require careful attention to this heterogeneity. Even more challenging will be the attempt to use the published markers as individual predictive tests in these ‘out-bred’ populations. As demonstrated in the recent HGDP analysis, the distribution of haplotypes tagged by various markers will be extremely complex. When effect sizes are modest even in the discovery populations, considerable uncertainty will apply in clinical translation. This concern, of course, already applies for genetic testing for monogenic disorders, such as cystic fibrosis ( 37 ). The NHGRI recently funded a multi-center project that will test well-replicated GWA findings in a large multi-ethnic epidemiologic samples ( 38 , 39 ) and these data should prove to be a useful guide to generalizability across populations.
With the accumulating knowledge being generated by GWAs and the advances in genotyping technology, it is expected that more ethnic groups will be represented in mapping studies, greatly expanding our knowledge of the genetics of common human complex disorders. A variety of challenges and opportunities can be identified. Heterogeneity is a well-recognized reason for failure to replicate genetic association findings. Even within the same ethnic group, subgroups with cryptic genetic heterogeneity may not be recognized. With the appropriate analytical tools, it is possible to identify heterogeneity and even cryptic relationships among unrelated individuals with great precision and the genotype–phenotype analyses could now be restricted to the sub-set of homogenous samples. At the same time, admixture mapping is an important opportunity that arises from the study of heterogeneous or ‘out-bred’ populations. Likewise, inclusion of more regional populations, in particular those with recent African ancestry, may assist in fine mapping by offering variety in LD structure and the spectrum of causal mutations.
This work was supported by research grants from the NIH (DK075787; HL086718; HL53353).
Conflict of Interest statement . None declared.