Abstract

Copy-number variation (CNV) is the most prevalent type of structural variation in the human genome, and contributes significantly to genetic heterogeneity. It has already been recognized that some CNVs can contribute to human phenotype, including rare genomic disorders and Mendelian diseases. Other CNVs are now amenable to genome-wide association studies so that their influence on human phenotypic diversity and disease susceptibility may soon be more readily determined. Population studies and reference databases for control and disease-associated samples are required to provide an information resource about CNV frequencies and their relative contribution to phenotypic outcomes. The relatively high cost of screening individual samples has tended to limit the number of controls assayed, and use of the data has often been hampered by the variety of technology platforms and analysis techniques. As a result, there is still a paucity of data on population frequency and distribution of CNVs, particularly for those that are rare. Here, we provide an example of how to discover new CNVs from existing genotype data from large-scale genetic epidemiological studies. We also discuss the need to expand surveys of CNV in different population-based cohorts and to apply the information to studies of human variation and disease.

INTRODUCTION

Cataloguing the nature and pattern of genome variation in the general population is fundamental in understanding human phenotypic diversity. Over the past decade, the sequencing of the human genome ( 1 , 2 ), completion of the HapMap project ( 3 ) and many other initiatives ( 4 , 5 ), have greatly extended our knowledge of the biallelic single nucleotide variants, illustrated by the ∼12 millions single nucleotide polymorphisms (SNPs) in public databases. In contrast, the characterization of larger (and often more complex) structural variants, such as copy-number variants (CNVs), has been more recent, and slowed by technical challenges in performing genome-wide screening ( 6 ).

Most current knowledge of CNVs originates from a small number of studies that have annotated mainly larger (typically >50 kb) and intermediate-size structural variation (>500 bp), using analyses that span multiple experimental techniques, samples and sample sizes ( 6–8 ). Meta analyses on such diverse studies are difficult, and as a result, the current knowledge of locations, frequencies and types of CNVs is still rudimentary ( 6 ). When combined into a non-redundant data set, the resulting genomic distribution of CNVs seems to be nonrandom–correlated with genomic features, including exons, segmental duplications and mobile elements such as Alu repeats and LINEs ( 9 , 10 ). More than 1000 genes have now been mapped within or close to regions that are affected by structural variants, and a number of disorders associated with CNVs have been described ( 7 , 11–14 ).

To date, 18.8% of the euchromatic human genome has been annotated as copy-number variable (see the Database of Genomic Variants, DGV ), but many additional control populations will need to be assessed to achieve near-saturation (at a given resolution) of a CNV map for the human genome. In this review, we discuss the need to annotate the remaining CNVs in human populations and describe some simple methods to facilitate their discovery using existing genotype datasets.

SAMPLE SOURCES FOR CNV DISCOVERY

To understand the role of CNVs in human disease, we first need insight into the prevalence of structural variants in the general population. We need to characterize large control populations, to resolve the structure, frequency, distribution and linkage disequilibrium (LD) of each variant. Uncovering the nature and patterns of CNVs in the general population will not only help to uncover the biological significance of de novo and infrequent hereditary CNVs in rare genomic disorders and Mendelian diseases ( 11 , 13 ), but also will facilitate the identification of variants that may confer increased risk for common diseases or act as modifiers of a given phenotype ( 15 , 16 ).

Some initial surveys of structural variation ( 17–20 ) have selected the HapMap sample set, composed of four populations with African, Asian and European ancestry. The rationale for using these samples was several-fold: (i) consent for genome-wide variation studies and full data release had been given, (ii) ancestral geography of donors was known and (iii) these samples have now been genotyped for over 3.6 M SNPs (HapMap phase II), making it possible to correlate structural and SNP variation. However, when interpreting the associated structural variation, consideration must also be given to phenotype, size and diversity of the populations.

First, although these samples are well characterized, no medical information was obtained, meaning that structural variation ascertained from them is not necessarily benign or neutral. Second, because the frequency of CNVs in the general population is not yet clear (mainly due to technical limitations) and the HapMap collection was intended to sample common SNPs and haplotypes, it is not known how well these samples reflect copy-number variability in the human genome. Third, the structural variation may in turn vary among different populations to an extent yet unknown, and the sample size may be too small for discovering CNV variation that represents each of the four populations. Finally, since the main source of the DNA for HapMap samples is transformed lymphoblastoid cell lines, it may be impossible to assess true germline CNV. We need to fully characterize the HapMap and related repositories with respect to CNVs, and to extend these efforts to other population controls that have more defined clinical information and larger samples sizes.

ASSESSING CNV IN A POPULATION-BASED CONTROL SAMPLE

Several genome-wide SNP association and case-control studies, each involving up to thousands of controls, have already been completed. Most of these datasets, however, have not yet been mined for CNVs. They are, therefore, a resource for discovering novel variants and gathering data on their frequency.

CNVs can be identified by examination of either probe intensity differences ( 18 , 21 ) or based on deviations from Hardy–Weinberg equilibrium ( 17 ) or expected Mendelian transmission ( 17 , 19 ). The probe intensity approach allows detection of CNV gains and losses. The Hardy–Weinberg or Mendelian transmission approaches only identify CNV losses, not gains. A limiting factor in all of these types of analyses is that previous generations of SNP arrays were optimized for allelic discrimination rather than copy-number measurement. Therefore, CNV discovery has been mainly restricted to that of large and simple biallelic variants with high confidence.

Commonly used CNV calling algorithms differ both in the number of samples used as reference (either one or a pool of references) and their calling criteria. As a result, they vary in the number and size of CNVs called. In fact, there are relatively few CNVs shared among studies carried out to date. The proportion of overlapping CNVs varies from 25% ( 22 ) (based on the size of overlap of deletions, when deletions were detected in the same samples using different methodologies), to 45% ( 25 ) (when assessing the occurrence of any CNV overlap between two surveys). A CNV region (CNVR), is an artificial grouping of CNVs overlapping or in close proximity to each other ( 18 , 26 ). Without standard samples ( 26 ) and independent validation of identified CNVRs (e.g. by quantitative PCR), it is difficult to assess whether this variation is due to differences in selectivity and sensitivity of these analyses, or whether it reflects different abilities to recognize specific classes of CNVs.

With these caveats in mind, we present a simple paradigm for rapid and reliable CNV discovery, based on probe intensity differences, using existing genotype data from population-based control cohorts. For this review, we tested for CNVs in a population-based sample from the PopGen project ( 27 ), which consisted of 506 unrelated healthy individuals from Northern Schleswig-Holstein (Northern Germany) genotyped on the Affymetrix 500K SNP array. Our strategy for CNV discovery in this sample is outlined in Figure  1 . Our experiments were designed to provide the most reliable CNV calls by minimizing the likelihood of false positive detection. Such a high level of stringency is essential because it is often not possible (for reasons of consent) to obtain the original DNA or cell lines for validation.

Figure 1.

Flowchart for CNV calling from Affymetrix 500 K SNP array set data. Stepwise procedures to minimize false positive detection of CNVs. The combined full CNV dataset comprises all CNVs detected in a sample by any of the three algorithms, dCHIP ( 26 ), CNAG ( 29 ) and GEMCA ( 30 ). The combined stringent CNV dataset (for the present study) includes only those CNVs that fulfill the criteria of step ‘a’ as applied to both PopGen (Figs  2 and 3 ) and HapMap datasets (Fig.  3 ). Steps ‘b’ and ‘c’ represent approaches that have been used in previous studies.

Figure 1.

Flowchart for CNV calling from Affymetrix 500 K SNP array set data. Stepwise procedures to minimize false positive detection of CNVs. The combined full CNV dataset comprises all CNVs detected in a sample by any of the three algorithms, dCHIP ( 26 ), CNAG ( 29 ) and GEMCA ( 30 ). The combined stringent CNV dataset (for the present study) includes only those CNVs that fulfill the criteria of step ‘a’ as applied to both PopGen (Figs  2 and 3 ) and HapMap datasets (Fig.  3 ). Steps ‘b’ and ‘c’ represent approaches that have been used in previous studies.

Figure 2.

Comparison between CNV studies. Venn-diagram illustrating the degree of overlap in CNV regions (CNVRs) discovered in 500 PopGen unrelated control individuals and those catalogued in the Database of Genomic Variants (DGV). PopGen controls: healthy population-based controls, recruitment area of Northern Schleswig-Holstein, Northern Germany ( 27 ). For comparison purposes, both DGV and PopGen CNVRs were mapped to the human genome assembly hg17 (NCBI build 35). Overlapping CNVs identified in different PopGen samples were merged into a non-redundant set of CNVRs with borders determined by the most distal overlapping CNVs. A similar procedure was used to generate a non-redundant set of DGV CNVRs, although we note that DGV entries were annotated using different technology platforms and data processing algorithms, with different degrees of experimental standardization and validation. A PopGen-CNVR was considered novel if at least 80% of the total length did not overlap a DGV–CNVR. Using these criteria, 61% of CNVRs detected in PopGen controls were considered novel. DGV contains 6559 entries (mainly CNVs) derived from 40 publications [version June 2007]. Numbers in brackets correspond to CNVRs derived from CNVs detected by at least two algorithms (i.e. Stringent CNVR Dataset).

Figure 2.

Comparison between CNV studies. Venn-diagram illustrating the degree of overlap in CNV regions (CNVRs) discovered in 500 PopGen unrelated control individuals and those catalogued in the Database of Genomic Variants (DGV). PopGen controls: healthy population-based controls, recruitment area of Northern Schleswig-Holstein, Northern Germany ( 27 ). For comparison purposes, both DGV and PopGen CNVRs were mapped to the human genome assembly hg17 (NCBI build 35). Overlapping CNVs identified in different PopGen samples were merged into a non-redundant set of CNVRs with borders determined by the most distal overlapping CNVs. A similar procedure was used to generate a non-redundant set of DGV CNVRs, although we note that DGV entries were annotated using different technology platforms and data processing algorithms, with different degrees of experimental standardization and validation. A PopGen-CNVR was considered novel if at least 80% of the total length did not overlap a DGV–CNVR. Using these criteria, 61% of CNVRs detected in PopGen controls were considered novel. DGV contains 6559 entries (mainly CNVs) derived from 40 publications [version June 2007]. Numbers in brackets correspond to CNVRs derived from CNVs detected by at least two algorithms (i.e. Stringent CNVR Dataset).

Figure 3.

Direct comparison of two CNV surveys using the same SNP array platform and CNV calling algorithms. ( A ) Venn-diagram illustrating the degree of overlap in CNVRs discovered in different HapMap populations. CNVs were derived from 269 HapMap samples ( 3 ) using Affymetrix 500 K SNP array sets and any of three CNV calling algorithms, dChip ( 26 ), CNAG ( 27 ) and GEMCA ( 28 ). Variants spanning centromeres were split if the number of consecutive SNP probes in either flank was > 3 and the derived variants were >1 kb. Immunoglobulin loci were masked before calling CNVs and calls were filtered for cell line-specific artifacts, as described in Redon et al. ( 18 ). Overlapping CNVs detected within each of the three HapMap continental populations (CEU: European, JPT+CHB: Asian, YRI: African) were merged into a non-redundant set of distinct CNVRs as described in Figure  2 . Numbers in brackets correspond to the stringent CNVR dataset for each population group (Fig.  2 ). Full and Stringent CNVR datasets were used for subsequent analyses. CEU- 90 Utah residents with ancestry from Northern and Western Europe, including 30 trios; YRI-90 Yoruba from Ibadan, Nigeria, including 30 trios; CHB+JPT-89 Han Chinese from Beijing, China and Japanese from Tokyo, Japan. ( B ) Triangle plot of inferred population structure for PopGen controls. PopGen individuals were clustered without regard to their geographical origin using STRUCTURE ( 23 , 24 ) and 780 unlinked SNPs, assuming three ancestral populations (Red). Genotypes from the 209 unrelated HapMap individuals were used as reference in the same clustering, and colored according to their continental origin: Green, European (60 unrelated CEU); Blue, African (60 unrelated YRI); Yellow, Asian (89 unrelated CHB+JPT). The three PopGen individuals outside the HapMap clusters (coefficient of ancestry < 0.90) were removed from further analyses. ( C ) Venn-diagram of CNVR overlap between HapMap populations and PopGen controls. Non-redundant and stringent CNVR datasets were calculated as in (A).

Figure 3.

Direct comparison of two CNV surveys using the same SNP array platform and CNV calling algorithms. ( A ) Venn-diagram illustrating the degree of overlap in CNVRs discovered in different HapMap populations. CNVs were derived from 269 HapMap samples ( 3 ) using Affymetrix 500 K SNP array sets and any of three CNV calling algorithms, dChip ( 26 ), CNAG ( 27 ) and GEMCA ( 28 ). Variants spanning centromeres were split if the number of consecutive SNP probes in either flank was > 3 and the derived variants were >1 kb. Immunoglobulin loci were masked before calling CNVs and calls were filtered for cell line-specific artifacts, as described in Redon et al. ( 18 ). Overlapping CNVs detected within each of the three HapMap continental populations (CEU: European, JPT+CHB: Asian, YRI: African) were merged into a non-redundant set of distinct CNVRs as described in Figure  2 . Numbers in brackets correspond to the stringent CNVR dataset for each population group (Fig.  2 ). Full and Stringent CNVR datasets were used for subsequent analyses. CEU- 90 Utah residents with ancestry from Northern and Western Europe, including 30 trios; YRI-90 Yoruba from Ibadan, Nigeria, including 30 trios; CHB+JPT-89 Han Chinese from Beijing, China and Japanese from Tokyo, Japan. ( B ) Triangle plot of inferred population structure for PopGen controls. PopGen individuals were clustered without regard to their geographical origin using STRUCTURE ( 23 , 24 ) and 780 unlinked SNPs, assuming three ancestral populations (Red). Genotypes from the 209 unrelated HapMap individuals were used as reference in the same clustering, and colored according to their continental origin: Green, European (60 unrelated CEU); Blue, African (60 unrelated YRI); Yellow, Asian (89 unrelated CHB+JPT). The three PopGen individuals outside the HapMap clusters (coefficient of ancestry < 0.90) were removed from further analyses. ( C ) Venn-diagram of CNVR overlap between HapMap populations and PopGen controls. Non-redundant and stringent CNVR datasets were calculated as in (A).

We selected three commonly used methods developed to extract CNV information from Affymetrix SNP array data—dChip ( 28 ), CNAG ( 29 ) and GEMCA ( 30 )—and the criterion that a CNV needs to be recognized independently by at least two methods. This approach will identify CNVs with a greater degree of confidence, although there is a concomitant loss of power to detect novel CNVs. Improved methodologies that are validated using standard samples will be required. Nonetheless, our analysis shows that the potential for discovering novel CNVRs remains high. With all three analysis methods, we found a total of 1023 regions of apparent copy-number differences (full dataset) (Fig.  2 ), of which 430 high-confidence CNVRs (average 369 kb in length; median 185 kb) remained after applying the stringent criteria (Stringent dataset). The frequency of copy-number gains in these CNVRs was 2.3-fold greater than the occurrence of deletions, with a total of 1083 gains (average size 478 kb, median 230 kb) versus 466 deletions (average 451 kb, median 197 kb) detected in 500 individuals.

A direct comparison using the full dataset reveals that at least 61% of detected PopGen CNVRs (criteria for defining CNVRs is described in Fig.  2 ) are not present in the DGV (Fig.  2 ), which serves as the main repository for DNA structural variation ( 31 ). Even when restricting our analysis to PopGen CNVs discovered by at least two different algorithms (stringent dataset) (Fig.  2 ), we still find 50% unique CNVRs. These findings reflect the limited sample sizes and hint at the presence of many more undiscovered structural variants in the genome.

A potential drawback of these comparisons is that differences in technology platforms and analysis techniques that were used to detect the CNVs reported in the DGV can confound results. To address this issue, we applied the strategy for CNV discovery outlined above to a second collection consisting of 270 HapMap samples and compared the degree of overlap in CNVRs discovered in the PopGen sample and HapMap populations. Both CNV surveys were performed on the same SNP array platform, and data were processed using the same single analysis pipeline (Fig.  1 ). In line with what is observed for the DGV (Fig.  2 ), only 22% of PopGen CNVRs were found to overlap HapMap CNVRs, and consistent figures were observed with the stringent set (26%). The CNVs identified using this pipeline provide one possible basis for developing a first comprehensive map of human CNVs in the PopGen set. In addition, these findings indicate that the low degree of overlap is not just due to technological bias, but reflects sampling limitations.

A second potential confounder in the analysis could be due to differences in the population distribution of CNVs. Evidence from SNP studies indicates that the allelic frequency of many SNPs can differ substantially among ethnic groups ( 32 ). After ensuring that samples to be compared are matched for continental ancestry (Fig.  3 B) and sub-dividing the HapMap CNVs by population ancestry (i.e. HapMap-European versus HapMap-others: HapMap -African+Asian), there was still considerable disagreement (Fig.  3 C), with only 57% overlap between PopGen and HapMap-European CNVRs, and 30% overlap between PopGen and HapMap-others.

Contrary to our expectations, there was no significant difference in the fraction of unique CNVRs among three continental HapMap sample groups (Fig.  3 A). Based on the out-of-Africa theory ( 33 ) of modern human origin, higher levels of genetic variation should exist in the more ancient African populations and less diversity in the younger, non-African populations, which is supported by SNP diversity studies ( 34 ). Accordingly, a higher number of distinct CNVs/ CNVRs within samples of African ancestry would also be expected. It is currently unclear whether the absence of such difference is due to the limited sample size or related to the distribution of CNVRs among populations. These observations further underscore the need to genotype large samples to determine CNV and CNVR frequencies and distribution across different populations.

FUTURE CONSIDERATIONS

Next-Generation Platforms

The next generation of SNP arrays (e.g. the commercial Affymetrix 5.0 and 6.0, and Illumina 1 M) has been designed to offer the potential to simultaneously interrogate SNPs and CNVs in a single experiment. Besides SNP probes, these arrays include dedicated non-polymorphic CNV probes for CNV detection. In addition, new probes have been added for higher coverage and probe quality has been improved, with a consequent reduction of the number of probes per polymorphic site. The advantage of using these hybrid SNP–CNV arrays is that they allow integrated association studies in which SNPs and CNV are considered together. They also enable a complementary analysis, where CNVs detected by differences in hybridization intensity can be cross-validated by looking at discrepancies in Hardy–Weinberg equilibrium in nearby SNPs. To benefit fully from these improvements, it is critical to adapt traditional statistics and develop an effective set of algorithms and analysis tools.

Linkage disequilibrium and Haplotypes

The combined analysis of SNPs and CNVs will also permit an evaluation of the feasibility of LD approaches and to evaluate patterns of LD with SNPs. The extent of LD between SNPs and the more commonly occurring CNVs (also called CNPs if they are found in > 1% of the population) is so far unclear. A number of studies to assess deletion polymorphisms in combination with SNP genotyping have indeed indicated that the identified deletions were ancestral and could be tagged by flanking SNPs ( 17 , 35 , 36 ). Non-recurrent structural variants are expected to show LD with SNPs on their original haplotype, as they arise only once and are likely subjected to the same selective pressure. We speculate that the common CNVs identified in these studies were subjected to mutation rates with similar order of magnitude (∼10 −8 per generation per site) as estimated for SNPs ( 37 ). If so, then the presence or absence of many structural variants may be inferred by typing selected tagging SNPs. Another study ( 18 ), however, using fewer than a hundred genotyped CNPs found that while there can be apparent LD between CNPs and SNPs, they are less well tagged than the frequency-matched SNPs. In this case, the lack of LD with nearby SNPs indicates that structural variants may have arisen repeatedly.

When focusing on CNVs in regions rich in segmental duplication, a recent study showed that copy-number measurements from such regions were less well captured ( 20 ). Because many CNVs seem to be located in regions of complex genomic structure, this currently limits the extent to which these (either multi-allelic or complex loss-gain) variants can be genotyped using tagging SNPs. In fact, while nearby SNPs could potentially be combined into a more informative haplotype to be used as a proxy for simple CNV events, the same may not apply to the more complex CNVs, whose junctions are embedded within large duplication structures and whose mutation rates can be as high as 10 −4 [as estimated for structural variants associated with rare sporadic genomic disorders ( 13 )].

A multi-stage approach is needed to fully assess LD around CNPs and to determine which CNVs may be found by LD mapping: (i) accurately genotype large and representative CNPs in HapMap samples, or other reference samples with dense information on SNP genotypes; (ii) select tags (SNPs and/or well-characterized CNPs) to act as proxies of the surrounding CNPs; (iii) evaluate the transferability and performance of those tags for other populations outside the HapMap project, which will broaden the applicability of HapMap findings.

SUMMARY

We present general methods to generate a population-based resource of CNVs derived from a homogeneous healthy control population, in this case originating from Northern Germany. In total, 430 distinct high-confidence CNVRs were detected (non-stringent: 1023), spanning approximately 158 Mb (or 4.8% of the genome). Of these, at least 50% had not been reported (at the time of this study) in the Database of Genomic Variants and at least 73% were not detected among the HapMap samples. Of the CNVRs discovered, 95.6% are rare (<2% frequency) and only 4.4% of the CNVRs were common changes (≥2%). We could not validate our primary data with other technologies because there was no consent to distribute the samples for this purpose. This will likely also be the case when examining many other similar datasets. The highly stringent calling criteria we applied afford confidence that the majority of the CNVs detected are bona fide variants. Our unpublished data (C.M. Marshall et al. , in preparation) indicate that, depending on the quality of the underlying experimental data, validation of CNVs detected using the analysis pipeline described here can approach 100% accuracy. Therefore, regardless of the genotyping platform used, other datasets could be tested for CNVs in a similar manner and the data presented for general use to the scientific community. Another study recently detected germ-line CNVs in a similarly large North American population ( 25 ).

Studies of diverse large-sized control cohorts such as the PopGen collection described here will ensure that CNVs in databases better represent those observed in the general population. Further technology developments may be required to genotype larger, more complex structural variation and to systematically distinguish corresponding haplotypes across ethnic groups with different ancestry. Such efforts are a necessary prerequisite to allow structural variation to become better integrated with the existing SNP-based LD maps. Ultimately, this will help assess the significance of this form of variation in its association with phenotype and disease.

ELECTRONIC DATE BASE INFORMATION

Database of Genomic Variants: ( http://projects.tcag.ca/variation/ ).

Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER): ( http://www.sanger.ac.uk/PostGenomics/decipher/ ).

FUNDING

Genome Canada/Ontario Genomics Institute (to S.W.S), Canadian Institutes of Health Research (CIHR) (to S.W.S.), the GlaxoSmithKline/CIHR Pathfinder Chair in Genetics and Genomics at SickKids and the University of Toronto (to S.W.S.), The Royal Netherlands Academy of Arts and Sciences (Ter Meulen Funds fellowship TMF/DA/5801 to D.P.), Netherlands Organization for Scientific Research (Rubicon fellowship 2007/02470/ALW to D.P.), SickKids Foundation (to C.R.M.), National Alliance for Research on Schizophrenia and Depression (NARSAD) (to C.R.M.).

Conflict of Interest statement . None declared.

ACKNOWLEDGEMENTS

We thank Stefan Schreiber and Andreas Fiebig from the Institute for Clinical Molecular Biology, Christian Albrechts University, Kiel, Germany, for providing Affymetrix 500 K microarray raw intensity and genotyping data from a sample set of the PopGen project. We thank the Centre for Applied Genomics, Hospital for Sick Children.

REFERENCES

1
Lander
E.S.
Linton
L.M.
Birren
B.
Nusbaum
C.
Zody
M.C.
Baldwin
J.
Devon
K.
Dewar
K.
Doyle
M.
FitzHugh
W.
, et al.  . 
Initial sequencing and analysis of the human genome
Nature
 , 
2001
, vol. 
409
 (pg. 
860
-
921
)
2
Venter
J.C.
Adams
M.D.
Myers
E.W.
Li
P.W.
Mural
R.J.
Sutton
G.G.
Smith
H.O.
Yandell
M.
Evans
C.A.
Holt
R.A.
, et al.  . 
The sequence of the human genome
Science
 , 
2001
, vol. 
291
 (pg. 
1304
-
1351
)
3
International HapMap Consortium.
A haplotype map of the human genome
Nature
 , 
2005
, vol. 
437
 (pg. 
1299
-
1320
)
4
Hinds
D.A.
Stuve
L.L.
Nilsen
G.B.
Halperin
E.
Eskin
E.
Ballinger
D.G.
Frazer
K.A.
Cox
D.R.
Whole-genome patterns of common DNA variation in three human populations
Science
 , 
2005
, vol. 
307
 (pg. 
1072
-
1079
)
5
Sachidanandam
R.
Weissman
D.
Schmidt
S.C.
Kakol
J.M.
Stein
L.D.
Marth
G.
Sherry
S.
Mullikin
J.C.
Mortimore
B.J.
Willey
D.L.
, et al.  . 
A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms
Nature
 , 
2001
, vol. 
409
 (pg. 
928
-
933
)
6
Carter
N.P.
Methods and strategies for analyzing copy number variation using DNA microarrays
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S16
-
S21
)
7
Feuk
L.
Carson
A.R.
Scherer
S.W.
Structural variation in the human genome
Nat. Rev. Genet.
 , 
2006
, vol. 
7
 (pg. 
85
-
97
)
8
Freeman
J.L.
Perry
G.H.
Feuk
L.
Redon
R.
McCarroll
S.A.
Altshuler
D.M.
Aburatani
H.
Jones
K.W.
Tyler-Smith
C.
Hurles
M.E.
, et al.  . 
Copy number variation: new insights in genome diversity
Genome Res.
 , 
2006
, vol. 
16
 (pg. 
949
-
961
)
9
Cooper
G.M.
Nickerson
D.A.
Eichler
E.E.
Mutational and selective effects on copy-number variants in the human genome
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S22
-
S29
)
10
Conrad
D.F.
Hurles
M.
The population genetics of structural variation
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S30
-
S36
)
11
Feuk
L.
Marshall
C.R.
Wintle
R.F.
Scherer
S.W.
Structural variants: changing the landscape of chromosomes and design of disease studies
Hum. Mol. Genet.
 , 
2006
, vol. 
15
 
Spec No. 1
(pg. 
R57
-
R66
)
12
Lee
J.A.
Lupski
J.R.
Genomic rearrangements and gene copy-number alterations as a cause of nervous system disorders
Neuron
 , 
2006
, vol. 
52
 (pg. 
103
-
121
)
13
Lupski
J.
Genomic rearrangments and sporadic disease
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S43
-
S47
)
14
Lee
C.
Iafrate
A.J.
Brothman
A.R.
Copy number variation and clinical cytogenetics diagnosis of constitutional disorders
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S48
-
S54
)
15
McCarroll
S.A.
Altshuler
D.
Copy number variation and association studies of human disease
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S37
-
S42
)
16
Beckmann
J.S.
Estivill
X.
Antonarakis
S.E.
Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability
Nat. Rev. Genet.
 , 
2007
, vol. 
8
 (pg. 
639
-
646
)
17
McCarroll
S.A.
Hadnott
T.N.
Perry
G.H.
Sabeti
P.C.
Zody
M.C.
Barrett
J.C.
Dallaire
S.
Gabriel
S.B.
Lee
C.
Daly
M.J.
, et al.  . 
Common deletion polymorphisms in the human genome
Nat. Genet.
 , 
2006
, vol. 
38
 (pg. 
86
-
92
)
18
Redon
R.
Ishikawa
S.
Fitch
K.R.
Feuk
L.
Perry
G.H.
Andrews
T.D.
Fiegler
H.
Shapero
M.H.
Carson
A.R.
Chen
W.
, et al.  . 
Global variation in copy number in the human genome
Nature
 , 
2006
, vol. 
444
 (pg. 
444
-
454
)
19
Conrad
D.F.
Andrews
T.D.
Carter
N.P.
Hurles
M.E.
Pritchard
J.K.
A high-resolution survey of deletion polymorphism in the human genome
Nat. Genet.
 , 
2006
, vol. 
38
 (pg. 
75
-
81
)
20
Locke
D.P.
Sharp
A.J.
McCarroll
S.A.
McGrath
S.D.
Newman
T.L.
Cheng
Z.
Schwartz
S.
Albertson
D.G.
Pinkel
D.
Altshuler
D.M.
, et al.  . 
Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome
Am. J. Hum. Genet.
 , 
2006
, vol. 
79
 (pg. 
275
-
290
)
21
Huang
J.
Wei
W.
Zhang
J.
Liu
G.
Bignell
G.R.
Stratton
M.R.
Futreal
P.A.
Wooster
R.
Jones
K.W.
Shapero
M.H.
Whole genome DNA copy number changes identified by high density oligonucleotide arrays
Hum. Genomics
 , 
2004
, vol. 
1
 (pg. 
287
-
299
)
22
Eichler
E.E.
Widening the spectrum of human genetic variation
Nat. Genet.
 , 
2006
, vol. 
38
 (pg. 
9
-
11
)
23
Pritchard
J.K.
Stephens
M.
Donnelly
P.
Inference of Population structure using multilocus genotype data
Genetics
 , 
2000
, vol. 
155
 (pg. 
945
-
959
)
24
Falush
D.
Stephens
M.
Pritchard
J.K.
Inference of Population structure using multilocus genotype data: linked loci and correlated allele frequencies
Genetics
 , 
2003
, vol. 
164
 (pg. 
1567
-
1587
)
25
Zogopoulos
G.
Ha
K.C.
Naqib
F.
Moore
S.
Kim
H.
Montpetit
A.
Robidoux
F.
Laflamme
P.
Cotterchio
M.
Greenwood
C.
, et al.  . 
Germ-line DNA copy number variation frequencies in a large North American population
Hum. Genet.
 , 
2007
 
doi:10.1007/s00439-007-0404-5
26
Scherer
S.W.
Lee
C.
Birney
E.
Altshuler
D.
Eichler
E.E.
Carter
N.
Hurles
M.E.
Feuk
L.
Challenges and standards in integrating surveys of structural variation
Nat. Genet.
 , 
2007
, vol. 
39
 (pg. 
S1
-
S54
)
27
Krawczak
M.
Nikolaus
S.
von Eberstein
H.
Croucher
P.J.
El Mokhtari
N.E.
Schreiber
S.
PopGen: population-based recruitment of patients and controls for the analysis of complex genotype-phenotype relationships
Community Genet.
 , 
2006
, vol. 
9
 (pg. 
55
-
61
)
28
Leykin
I.
Hao
K.
Cheng
J.
Meyer
N.
Pollak
M.R.
Smith
R.J.
Wong
W.H.
Rosenow
C.
Li
C.
Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data
BMC Genet.
 , 
2005
, vol. 
6
 pg. 
7
 
29
Nannya
Y.
Sanada
M.
Nakazaki
K.
Hosoya
N.
Wang
L.
Hangaishi
A.
Kurokawa
M.
Chiba
S.
Bailey
D.K.
Kennedy
G.C.
, et al.  . 
A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays
Cancer Res.
 , 
2005
, vol. 
65
 (pg. 
6071
-
6079
)
30
Komura
D.
Shen
F.
Ishikawa
S.
Fitch
K.R.
Chen
W.
Zhang
J.
Liu
G.
Ihara
S.
Nakamura
H.
Hurles
M.E.
, et al.  . 
Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays
Genome Res.
 , 
2006
, vol. 
16
 (pg. 
1575
-
1584
)
31
Zhang
J.
Feuk
L.
Duggan
G.E.
Khaja
R.
Scherer
S.W.
Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome
Cytogenet. Genome Res.
 , 
2006
, vol. 
115
 (pg. 
205
-
214
)
32
Stephens
J.C.
Schneider
J.A.
Tanguay
D.A.
Choi
J.
Acharya
T.
Stanley
S.E.
Jiang
R.
Messer
C.J.
Chew
A.
Han
J.H.
, et al.  . 
Haplotype variation and linkage disequilibrium in 313 human genes
Science
 , 
2001
, vol. 
293
 (pg. 
489
-
493
)
33
Vigilant
L.
Stoneking
M.
Harpending
H.
Hawkes
K.
Wilson
A.C.
African populations and the evolution of human mitochondrial DNA
Science
 , 
1991
, vol. 
253
 (pg. 
1503
-
1507
)
34
Tishkoff
S.A.
Verrelli
B.C.
Patterns of human genetic diversity: implications for human evolutionary history and disease
Annu. Rev. Genomics Hum. Genet
 , 
2003
, vol. 
4
 (pg. 
293
-
340
)
35
Hinds
D.A.
Kloek
A.P.
Jen
M.
Chen
X.
Frazer
K.A.
Common deletions and SNPs are in linkage disequilibrium in the human genome
Nat. Genet.
 , 
2006
, vol. 
38
 (pg. 
82
-
85
)
36
Newman
T.L.
Rieder
M.J.
Morrison
V.A.
Sharp
A.J.
Smith
J.D.
Sprague
L.J.
Kaul
R.
Carlson
C.S.
Olson
M.V.
Nickerson
D.A.
, et al.  . 
High-throughput genotyping of intermediate-size structural variation
Hum. Mol. Genet.
 , 
2006
, vol. 
15
 (pg. 
1159
-
1167
)
37
Nachman
M.W.
Crowell
S.L.
Estimate of the mutation rate per nucleotide in humans
Genetics
 , 
2000
, vol. 
156
 (pg. 
297
-
304
)

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.