Copy-number variation (CNV) is the most prevalent type of structural variation in the human genome, and contributes significantly to genetic heterogeneity. It has already been recognized that some CNVs can contribute to human phenotype, including rare genomic disorders and Mendelian diseases. Other CNVs are now amenable to genome-wide association studies so that their influence on human phenotypic diversity and disease susceptibility may soon be more readily determined. Population studies and reference databases for control and disease-associated samples are required to provide an information resource about CNV frequencies and their relative contribution to phenotypic outcomes. The relatively high cost of screening individual samples has tended to limit the number of controls assayed, and use of the data has often been hampered by the variety of technology platforms and analysis techniques. As a result, there is still a paucity of data on population frequency and distribution of CNVs, particularly for those that are rare. Here, we provide an example of how to discover new CNVs from existing genotype data from large-scale genetic epidemiological studies. We also discuss the need to expand surveys of CNV in different population-based cohorts and to apply the information to studies of human variation and disease.
Cataloguing the nature and pattern of genome variation in the general population is fundamental in understanding human phenotypic diversity. Over the past decade, the sequencing of the human genome ( 1 , 2 ), completion of the HapMap project ( 3 ) and many other initiatives ( 4 , 5 ), have greatly extended our knowledge of the biallelic single nucleotide variants, illustrated by the ∼12 millions single nucleotide polymorphisms (SNPs) in public databases. In contrast, the characterization of larger (and often more complex) structural variants, such as copy-number variants (CNVs), has been more recent, and slowed by technical challenges in performing genome-wide screening ( 6 ).
Most current knowledge of CNVs originates from a small number of studies that have annotated mainly larger (typically >50 kb) and intermediate-size structural variation (>500 bp), using analyses that span multiple experimental techniques, samples and sample sizes ( 6–8 ). Meta analyses on such diverse studies are difficult, and as a result, the current knowledge of locations, frequencies and types of CNVs is still rudimentary ( 6 ). When combined into a non-redundant data set, the resulting genomic distribution of CNVs seems to be nonrandom–correlated with genomic features, including exons, segmental duplications and mobile elements such as Alu repeats and LINEs ( 9 , 10 ). More than 1000 genes have now been mapped within or close to regions that are affected by structural variants, and a number of disorders associated with CNVs have been described ( 7 , 11–14 ).
To date, 18.8% of the euchromatic human genome has been annotated as copy-number variable (see the Database of Genomic Variants, DGV ), but many additional control populations will need to be assessed to achieve near-saturation (at a given resolution) of a CNV map for the human genome. In this review, we discuss the need to annotate the remaining CNVs in human populations and describe some simple methods to facilitate their discovery using existing genotype datasets.
SAMPLE SOURCES FOR CNV DISCOVERY
To understand the role of CNVs in human disease, we first need insight into the prevalence of structural variants in the general population. We need to characterize large control populations, to resolve the structure, frequency, distribution and linkage disequilibrium (LD) of each variant. Uncovering the nature and patterns of CNVs in the general population will not only help to uncover the biological significance of de novo and infrequent hereditary CNVs in rare genomic disorders and Mendelian diseases ( 11 , 13 ), but also will facilitate the identification of variants that may confer increased risk for common diseases or act as modifiers of a given phenotype ( 15 , 16 ).
Some initial surveys of structural variation ( 17–20 ) have selected the HapMap sample set, composed of four populations with African, Asian and European ancestry. The rationale for using these samples was several-fold: (i) consent for genome-wide variation studies and full data release had been given, (ii) ancestral geography of donors was known and (iii) these samples have now been genotyped for over 3.6 M SNPs (HapMap phase II), making it possible to correlate structural and SNP variation. However, when interpreting the associated structural variation, consideration must also be given to phenotype, size and diversity of the populations.
First, although these samples are well characterized, no medical information was obtained, meaning that structural variation ascertained from them is not necessarily benign or neutral. Second, because the frequency of CNVs in the general population is not yet clear (mainly due to technical limitations) and the HapMap collection was intended to sample common SNPs and haplotypes, it is not known how well these samples reflect copy-number variability in the human genome. Third, the structural variation may in turn vary among different populations to an extent yet unknown, and the sample size may be too small for discovering CNV variation that represents each of the four populations. Finally, since the main source of the DNA for HapMap samples is transformed lymphoblastoid cell lines, it may be impossible to assess true germline CNV. We need to fully characterize the HapMap and related repositories with respect to CNVs, and to extend these efforts to other population controls that have more defined clinical information and larger samples sizes.
ASSESSING CNV IN A POPULATION-BASED CONTROL SAMPLE
Several genome-wide SNP association and case-control studies, each involving up to thousands of controls, have already been completed. Most of these datasets, however, have not yet been mined for CNVs. They are, therefore, a resource for discovering novel variants and gathering data on their frequency.
CNVs can be identified by examination of either probe intensity differences ( 18 , 21 ) or based on deviations from Hardy–Weinberg equilibrium ( 17 ) or expected Mendelian transmission ( 17 , 19 ). The probe intensity approach allows detection of CNV gains and losses. The Hardy–Weinberg or Mendelian transmission approaches only identify CNV losses, not gains. A limiting factor in all of these types of analyses is that previous generations of SNP arrays were optimized for allelic discrimination rather than copy-number measurement. Therefore, CNV discovery has been mainly restricted to that of large and simple biallelic variants with high confidence.
Commonly used CNV calling algorithms differ both in the number of samples used as reference (either one or a pool of references) and their calling criteria. As a result, they vary in the number and size of CNVs called. In fact, there are relatively few CNVs shared among studies carried out to date. The proportion of overlapping CNVs varies from 25% ( 22 ) (based on the size of overlap of deletions, when deletions were detected in the same samples using different methodologies), to 45% ( 25 ) (when assessing the occurrence of any CNV overlap between two surveys). A CNV region (CNVR), is an artificial grouping of CNVs overlapping or in close proximity to each other ( 18 , 26 ). Without standard samples ( 26 ) and independent validation of identified CNVRs (e.g. by quantitative PCR), it is difficult to assess whether this variation is due to differences in selectivity and sensitivity of these analyses, or whether it reflects different abilities to recognize specific classes of CNVs.
With these caveats in mind, we present a simple paradigm for rapid and reliable CNV discovery, based on probe intensity differences, using existing genotype data from population-based control cohorts. For this review, we tested for CNVs in a population-based sample from the PopGen project ( 27 ), which consisted of 506 unrelated healthy individuals from Northern Schleswig-Holstein (Northern Germany) genotyped on the Affymetrix 500K SNP array. Our strategy for CNV discovery in this sample is outlined in Figure 1 . Our experiments were designed to provide the most reliable CNV calls by minimizing the likelihood of false positive detection. Such a high level of stringency is essential because it is often not possible (for reasons of consent) to obtain the original DNA or cell lines for validation.
We selected three commonly used methods developed to extract CNV information from Affymetrix SNP array data—dChip ( 28 ), CNAG ( 29 ) and GEMCA ( 30 )—and the criterion that a CNV needs to be recognized independently by at least two methods. This approach will identify CNVs with a greater degree of confidence, although there is a concomitant loss of power to detect novel CNVs. Improved methodologies that are validated using standard samples will be required. Nonetheless, our analysis shows that the potential for discovering novel CNVRs remains high. With all three analysis methods, we found a total of 1023 regions of apparent copy-number differences (full dataset) (Fig. 2 ), of which 430 high-confidence CNVRs (average 369 kb in length; median 185 kb) remained after applying the stringent criteria (Stringent dataset). The frequency of copy-number gains in these CNVRs was 2.3-fold greater than the occurrence of deletions, with a total of 1083 gains (average size 478 kb, median 230 kb) versus 466 deletions (average 451 kb, median 197 kb) detected in 500 individuals.
A direct comparison using the full dataset reveals that at least 61% of detected PopGen CNVRs (criteria for defining CNVRs is described in Fig. 2 ) are not present in the DGV (Fig. 2 ), which serves as the main repository for DNA structural variation ( 31 ). Even when restricting our analysis to PopGen CNVs discovered by at least two different algorithms (stringent dataset) (Fig. 2 ), we still find 50% unique CNVRs. These findings reflect the limited sample sizes and hint at the presence of many more undiscovered structural variants in the genome.
A potential drawback of these comparisons is that differences in technology platforms and analysis techniques that were used to detect the CNVs reported in the DGV can confound results. To address this issue, we applied the strategy for CNV discovery outlined above to a second collection consisting of 270 HapMap samples and compared the degree of overlap in CNVRs discovered in the PopGen sample and HapMap populations. Both CNV surveys were performed on the same SNP array platform, and data were processed using the same single analysis pipeline (Fig. 1 ). In line with what is observed for the DGV (Fig. 2 ), only 22% of PopGen CNVRs were found to overlap HapMap CNVRs, and consistent figures were observed with the stringent set (26%). The CNVs identified using this pipeline provide one possible basis for developing a first comprehensive map of human CNVs in the PopGen set. In addition, these findings indicate that the low degree of overlap is not just due to technological bias, but reflects sampling limitations.
A second potential confounder in the analysis could be due to differences in the population distribution of CNVs. Evidence from SNP studies indicates that the allelic frequency of many SNPs can differ substantially among ethnic groups ( 32 ). After ensuring that samples to be compared are matched for continental ancestry (Fig. 3 B) and sub-dividing the HapMap CNVs by population ancestry (i.e. HapMap-European versus HapMap-others: HapMap -African+Asian), there was still considerable disagreement (Fig. 3 C), with only 57% overlap between PopGen and HapMap-European CNVRs, and 30% overlap between PopGen and HapMap-others.
Contrary to our expectations, there was no significant difference in the fraction of unique CNVRs among three continental HapMap sample groups (Fig. 3 A). Based on the out-of-Africa theory ( 33 ) of modern human origin, higher levels of genetic variation should exist in the more ancient African populations and less diversity in the younger, non-African populations, which is supported by SNP diversity studies ( 34 ). Accordingly, a higher number of distinct CNVs/ CNVRs within samples of African ancestry would also be expected. It is currently unclear whether the absence of such difference is due to the limited sample size or related to the distribution of CNVRs among populations. These observations further underscore the need to genotype large samples to determine CNV and CNVR frequencies and distribution across different populations.
The next generation of SNP arrays (e.g. the commercial Affymetrix 5.0 and 6.0, and Illumina 1 M) has been designed to offer the potential to simultaneously interrogate SNPs and CNVs in a single experiment. Besides SNP probes, these arrays include dedicated non-polymorphic CNV probes for CNV detection. In addition, new probes have been added for higher coverage and probe quality has been improved, with a consequent reduction of the number of probes per polymorphic site. The advantage of using these hybrid SNP–CNV arrays is that they allow integrated association studies in which SNPs and CNV are considered together. They also enable a complementary analysis, where CNVs detected by differences in hybridization intensity can be cross-validated by looking at discrepancies in Hardy–Weinberg equilibrium in nearby SNPs. To benefit fully from these improvements, it is critical to adapt traditional statistics and develop an effective set of algorithms and analysis tools.
Linkage disequilibrium and Haplotypes
The combined analysis of SNPs and CNVs will also permit an evaluation of the feasibility of LD approaches and to evaluate patterns of LD with SNPs. The extent of LD between SNPs and the more commonly occurring CNVs (also called CNPs if they are found in > 1% of the population) is so far unclear. A number of studies to assess deletion polymorphisms in combination with SNP genotyping have indeed indicated that the identified deletions were ancestral and could be tagged by flanking SNPs ( 17 , 35 , 36 ). Non-recurrent structural variants are expected to show LD with SNPs on their original haplotype, as they arise only once and are likely subjected to the same selective pressure. We speculate that the common CNVs identified in these studies were subjected to mutation rates with similar order of magnitude (∼10 −8 per generation per site) as estimated for SNPs ( 37 ). If so, then the presence or absence of many structural variants may be inferred by typing selected tagging SNPs. Another study ( 18 ), however, using fewer than a hundred genotyped CNPs found that while there can be apparent LD between CNPs and SNPs, they are less well tagged than the frequency-matched SNPs. In this case, the lack of LD with nearby SNPs indicates that structural variants may have arisen repeatedly.
When focusing on CNVs in regions rich in segmental duplication, a recent study showed that copy-number measurements from such regions were less well captured ( 20 ). Because many CNVs seem to be located in regions of complex genomic structure, this currently limits the extent to which these (either multi-allelic or complex loss-gain) variants can be genotyped using tagging SNPs. In fact, while nearby SNPs could potentially be combined into a more informative haplotype to be used as a proxy for simple CNV events, the same may not apply to the more complex CNVs, whose junctions are embedded within large duplication structures and whose mutation rates can be as high as 10 −4 [as estimated for structural variants associated with rare sporadic genomic disorders ( 13 )].
A multi-stage approach is needed to fully assess LD around CNPs and to determine which CNVs may be found by LD mapping: (i) accurately genotype large and representative CNPs in HapMap samples, or other reference samples with dense information on SNP genotypes; (ii) select tags (SNPs and/or well-characterized CNPs) to act as proxies of the surrounding CNPs; (iii) evaluate the transferability and performance of those tags for other populations outside the HapMap project, which will broaden the applicability of HapMap findings.
We present general methods to generate a population-based resource of CNVs derived from a homogeneous healthy control population, in this case originating from Northern Germany. In total, 430 distinct high-confidence CNVRs were detected (non-stringent: 1023), spanning approximately 158 Mb (or 4.8% of the genome). Of these, at least 50% had not been reported (at the time of this study) in the Database of Genomic Variants and at least 73% were not detected among the HapMap samples. Of the CNVRs discovered, 95.6% are rare (<2% frequency) and only 4.4% of the CNVRs were common changes (≥2%). We could not validate our primary data with other technologies because there was no consent to distribute the samples for this purpose. This will likely also be the case when examining many other similar datasets. The highly stringent calling criteria we applied afford confidence that the majority of the CNVs detected are bona fide variants. Our unpublished data (C.M. Marshall et al. , in preparation) indicate that, depending on the quality of the underlying experimental data, validation of CNVs detected using the analysis pipeline described here can approach 100% accuracy. Therefore, regardless of the genotyping platform used, other datasets could be tested for CNVs in a similar manner and the data presented for general use to the scientific community. Another study recently detected germ-line CNVs in a similarly large North American population ( 25 ).
Studies of diverse large-sized control cohorts such as the PopGen collection described here will ensure that CNVs in databases better represent those observed in the general population. Further technology developments may be required to genotype larger, more complex structural variation and to systematically distinguish corresponding haplotypes across ethnic groups with different ancestry. Such efforts are a necessary prerequisite to allow structural variation to become better integrated with the existing SNP-based LD maps. Ultimately, this will help assess the significance of this form of variation in its association with phenotype and disease.
ELECTRONIC DATE BASE INFORMATION
Database of Genomic Variants: ( http://projects.tcag.ca/variation/ ).
Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER): ( http://www.sanger.ac.uk/PostGenomics/decipher/ ).
Genome Canada/Ontario Genomics Institute (to S.W.S), Canadian Institutes of Health Research (CIHR) (to S.W.S.), the GlaxoSmithKline/CIHR Pathfinder Chair in Genetics and Genomics at SickKids and the University of Toronto (to S.W.S.), The Royal Netherlands Academy of Arts and Sciences (Ter Meulen Funds fellowship TMF/DA/5801 to D.P.), Netherlands Organization for Scientific Research (Rubicon fellowship 2007/02470/ALW to D.P.), SickKids Foundation (to C.R.M.), National Alliance for Research on Schizophrenia and Depression (NARSAD) (to C.R.M.).
Conflict of Interest statement . None declared.
We thank Stefan Schreiber and Andreas Fiebig from the Institute for Clinical Molecular Biology, Christian Albrechts University, Kiel, Germany, for providing Affymetrix 500 K microarray raw intensity and genotyping data from a sample set of the PopGen project. We thank the Centre for Applied Genomics, Hospital for Sick Children.