An exome sequencing pipeline for identifying and genotyping common CNVs associated with disease with application to psoriasis

Motivation: Despite the prevalence of copy number variation (CNV) in the human genome, only a handful of confirmed associations have been reported between common CNVs and complex disease. This may be partially attributed to the difficulty in accurately genotyping CNVs in large cohorts using array-based technologies. Exome sequencing is now widely being applied to case–control cohorts and presents an exciting opportunity to look for common CNVs associated with disease. Results: We developed ExoCNVTest: an exome sequencing analysis pipeline to identify disease-associated CNVs and to generate absolute copy number genotypes at putatively associated loci. Our method re-discovered the LCE3B_LCE3C CNV association with psoriasis (P-value = 5 × 10e−6) while controlling inflation of test statistics (λ < 1). ExoCNVTest-derived absolute CNV genotypes were 97.4% concordant with PCR-derived genotypes at this locus. Availability and implementation: ExoCNVTest has been implemented in Java and R and is freely available from www1.imperial.ac.uk/medicine/people/l.coin/. Contact: wangj@genomics.org.cn or Lachlan.J.M.Coin@genomics.org.cn


INTRODUCTION
Common copy number variation (CNV) in the human genome has been extensively catalogued using both array comparative hybridization (aCGH) as well as more recently with next-generation whole genome sequencing (Conrad et al., 2010;Mills et al., 2011;Redon et al., 2006). Studies have demonstrated that common CNV is prevalent in genic as well as non-genic regions of the genome and accounts for ∼18% of detected genetic variation in gene expression (Stranger et al., 2007). While common copy number variants associated with complex disease have been identified (Aitman et al., 2006;de Cid et al., 2009;Diskin et al., 2009;McCarroll et al., 2008), they are vastly fewer than reported single-nucleotide polymorphism (SNP) associations. In a landmark study, the Wellcome Trust Case Control Consortium identified the majority of common CNV with allele frequency >5% and genotyped this variation in seven * To whom correspondence should be addressed. complex disease cohorts and two control cohorts using microarray-based comparative genomic hybridization (aCGH). However, analysis of these data revealed only three previously identified copy number associations (Craddock et al., 2010). This study also comprehensively identified many sources of bias which continue to plague CNV association studies, including batch effects, systematic differences between DNA extracted from cell lines versus whole blood, homology, particularly to the MHC, as well as difficulty in accurately determining CNV genotypes. The authors of this study concluded that the subset of CNVs that are genotypeable on CGH arrays-mostly simple non-recurrent deletions-do not explain a large proportion of phenotypic variation. However, there remains the possibility of extensive disease association with more complex CNV, as well as common CNV with small to moderate effect sizes.
In contrast to most complex disease, the role of common CNV in the pathogenesis of psoriasis has been robustly demonstrated and replicated across multiple populations worldwide (de Cid et al., 2009;Hollox et al., 2008;Huffmeier et al., 2010;Li et al., 2011;Riveira-Munoz et al., 2011;Wiwanitkit, 2010;Xu et al., 2011). In particular, association between psoriasis and a deletion spanning the 'late cornified envelope' genes LCE3B and LCE3C was originally shown in European populations (de Cid et al., 2009) and was subsequently replicated in a Chinese cohort (Xu et al., 2011). This CNV is a simple 32.2-kb deletion, with allele frequency of ∼60%, for which both heterozygote and homozygote deletions are found in the population. The LCE3B_LCE3C deletion has an odds ratio of ∼1.38 for psoriasis.
Exome capture sequencing is now widely being applied to both Mendelian and complex disease cohorts (Ng et al., 2010), primarily to find rare associations with disease which have not been identified by extensive GWAS (O'Roak et al., 2011), as well as to refine association signals already reported (Bonnefond et al., 2012;Palmer et al., 2011;Rivas et al., 2011). Exome sequencing uses hybridization technology to first capture the majority of exons, and then subsequently applies massively parallel sequencing to these captured fragments. As the sequencing step is a stochastic process, the number of times a sequence is read should be reflective of its relative copy number in the original sample. However, because of the hybridization step, many of the biases present in array-based studies, such as batch effects and, effects from obtaining DNA from different sources are still present in these data. Nevertheless, the utility of exome sequence data to examine CNV in matched tumour/healthy samples has already been demonstrated (Sathirapongsasuti et al., 2011).
Exome capture may also be a promising approach for finding common copy number association with disease. However, the ability to accurately call absolute copy number genotypes from these data has yet to be demonstrated. Moreover, there has not yet been any demonstration of the ability to re-discover known common CNV associations from these data. In this article, we develop an approach to find common CNV associated with complex disease. We apply this approach to exome sequence data obtained on a psoriasis case-control cohort. We demonstrate that we are able to overcome inflation of test-statistics while identifying the known LCE3B_LCE3C deletion association with P-value of 5 × 10e−6 in a case-control cohort of 1500 samples. Moreover, we show that we can use exome sequence data to generate highly accurate absolute CNV genotypes at this locus.

Exon capture and sequencing
Genomic DNA of each individual was hybridized with NimbleGen 2.1 Mprobe sequence capture array to enrich the exonic DNA in each library. The array captures 18 654 of the 20 091 genes deposited in Consensus Coding Sequence Region database.
DNA was randomly fragmented by Covaris to an average size of 200-300 bp, and adaptors were ligated to both ends of the resulting fragments. The adaptor-ligated DNA products were amplified with different index primers. The same amount of PCR product from pairs of samples was pooled then hybridized to each capture array following NimbleGen's protocol, after which the exome-enriched DNA fragments were eluted from the array and amplified by ligation-mediated PCR, and non-hybridized fragments were then washed out.
Sequencing was carried out on the Illumina HiSeq 2000 platform for each captured library independently to ensure each sample had at least ∼15-fold coverage. Raw image files were processed by Illumina Pipeline (version 1.3.4) for base-calling with default parameters and the sequences of each individual were generated as 90-bp reads.
SOAPaligner (v2.21) (Li et al., 2008) was used to align the sequencing reads to the NCBI human genome reference assembly (build 36.3) with a maximum of three mismatches and the parameters set as '-a -b -D -o -u -r 1 -s 40 -l 35'.

Read depth normalization
Within each capture region i, and for each individual k = 1 … N, we record the total aligned read depth divided by the average read depth for that sample, which we label r ijk , for each base pair j = 1 … M. We then calculate the first 'local' principal component FPC i,k for each matrix R i = {r ijk }, using an iterative scaling algorithm outlined below. This algorithm has a very low memory footprint and is highly computationally efficient. We then calculate the first 50 'global' principal components of the matrix comprised of all of the local first principal components {FPC i,k }, which we label GPC 1 … GPC 50 . This calculation is carried out using the singular value decomposition, which is not as fast or memory efficient as the iterative scaling algorithm, but is more accurate for calculating lower order PCs. These global PCs are considered to capture biases which might lead to artefactual CNV association. We then calculate modified local PCs by projecting out the first H of these global PCs, for fixed H : (2) is the projection of the FPC i onto the h th global PC, using <> to denote the inner product in a N dimensional vector space.

Iterative scaling algorithm to calculate FPC
We have implemented an algorithm described in (Roweis, 1998) for computationally efficient calculation of the first principal component (FPC). We first centralize the matrix R i so that each column has zero mean. We first FPC i to be a vector of length N with entries randomly chosen from the interval (0,1). We then iteratively update FPC i via the following equations: until convergence of FPC i . This algorithm avoids the computationally expensive calculation of the covariance matrix R * R T . Moreover, it has a very low memory footprint as the calculation in Equation (3) can be calculated via streaming one column of the depth matrix at a time. This algorithm accurately calculates the FPC but becomes progressively less accurate for lower order PCs.

Association with disease status
For a fixed level of adjustment, H , we then carry out association of FPC (H) i over all regions i, with the case-control phenotype. The association is modelled with logistic regression, with P-values calculated using the GLM package in R. We plot quantile-quantile plots for different values of H and also calculate the genomic control inflation factor λ in order to inspect for inflation of test statistics.

Calculating absolute copy number genotypes
For a given putatively association region i, and for each sample, k, we calculate {ra ijk } as the average depth in non-overlapping 100-bp windows j divided by the genome-wide average depth. We then project out the first H global PCs to calculate ra (H) ij. = ra ij. − h=1.H proj(ra ij. , GPC h ) We then apply the population-haplotype clustering model of cnvHap (Coin et al., 2010) to these data in order to jointly segment each sample into different CNV states. This is achieved by using a different emission distribution per hidden copy number genotype CN = 0 … 4 per 100-bp window to model the normalized read depth. Spatial modelling of the CNV is achieved via a HMM as previously described. CNVhap returns the CNV genotypes for each sample, as well as an estimate of the certainty of these genotypes.

RESULTS
We applied ExoCNVTest to exome sequence data collected on 800 controls and 700 psoriasis cases. Association of the uncorrected FPC was highly inflated with inflation factor (λ) of 8.5 ( Fig. 1 and Table 1). In this analysis the known association LCE3B_LCE3C was ranked 3750 out of 105 088 regions. Correction for 5 and 20 GPCs progressively reduced inflation of test statistics and also increased the rank of LCE3B_LCE3C. However, correction for 40 GPCs was required to entirely remove inflation of test statistics and reduce λ< 1. In this analysis the LCE3B_LCE3C cluster was ranked 25th, with P-value of 7.2e−5.
In order to investigate whether our approach was accurately capturing the underlying CNV present in LCE3B_LCE3C, we  investigated the correlation between the adjusted FPC in these capture regions and gold standard copy number genotypes obtained from PCR in these samples. We observed increasing correlation between the adjusted FPC and the gold-standard PCR genotypes up to 20 GPC, but not beyond (Table 2). This demonstrates that adjusting for GPCs has two effects-the first being to reduce spurious signals and the second being to increase the correlation between real and detected signals at genuine associated loci, although the optimal number of GPC correction may be different for these two goals. We also observed substantial GC biases in uncorrected association statistics, with an over-representation of significantly associated signals at ∼33% and 60% GC content ( Fig. 2A). Correcting for 40 GPCs removed this bias (Fig. 2B). Finally, we attempted to obtain absolute copy number genotypes at the LCE3B_LCE3C loci. We did this by averaging the persample read depth in 100-bp windows divided by average genome wide read depth and then projecting out the first 40 global PCs.  These normalized data were presented to cnvHap, with initial copy number cluster positions at 0, 0.5, 1, 1.5, 2 normalized read depth for 0, 1, 2, 3, 4 copy numbers, respectively. cnvHap updates the positions of these cluster positions within a generalized expectation maximization framework and then reports the per-sample copy number genotypes (Fig. 3). The reported CNV genotypes were found to be 97.4% concordant with gold standard CNV genotypes, with a missing rate of 14.7% (Table 3). These CNV genotypes had stronger association with case-control status (P-value of 5e−6) than FPC, demonstrating the value of using these genotypes to refine the association signal.

DISCUSSION
While approaches for finding common SNP association with disease have become routine, finding common CNV associated with complex disease has proven more elusive. Accurately genotyping CNVs in large case-control cohorts using data from SNP genotyping arrays remains very challenging, and attempts to associate CNV genotypes defined in this way with disease have often led to highly inflated test statistics, as well as results that are difficult to independently replicate. Array CGH was used to discover the LCE3B_LCE3C association, and studies using array CGH tend to be more robust. However, aCGH studies are still subject to substantial batch effects, and aCGH has not been widely applied to very large multi-centre case-control cohorts needed to detect small to modest effect sizes. Now that exome sequencing is being routinely applied to case-control cohorts, there is a new opportunity to look for common CNV associations with disease. This article provides the first 'proof-of-principle' that exome sequencing can be used in this way, by developing an algorithm which adequately controls for batch effects, and applying this algorithm to re-discover one of the few confirmed common CNV associations with complex disease.
We have also shown that as well as detecting CNV association with disease, it is possible to calculate highly accurate absolute copy number genotypes at associated loci. To do this, we have applied cnvHap to perform joint CNV segmentation and genotyping. cnvHap has been shown to be one of the most sensitive CNV discovery and genotyping algorithms, due to its flexible modelling of intensity data (in this case read-depth data) across an entire population.
Application of our method yielded an association P-value between LCE3B_LCE3C_del and psoriasis of 7.2e−5 using FPC corrected for 40 GPCs, and 5e−6 using estimated absolute CNV genotypes. While these P-values are not as significant as the initial reported P-value of 1.38e−8 (de Cid et al., 2009), this initial study was based on 2831 samples while the data presented here are derived from 1500 samples. A UK replication study with 1962 samples obtained an association P-value of 0.03, while a larger German replication study with 2278 samples obtained an association P-value of 3.3e−5 (Riveira-Munoz et al., 2011). Principal components are widely used to adjust for the inflation of test statistics due to population structure in genome-wide association studies; however, these studies typically only adjust for up to 10 PCs, with a greater number of PCs required for highly structured populations. The fact that 40 PCs were required to correct for readdepth biases likely reflects the fact that the 'batch' size consisted of two samples per capture array.
We might speculate that CNV of genes, and more specifically exons, may generate substantial phenotypic variation. Our ability to assess this effect to date has been limited by the technical challenge of array technologies. Exome sequencing, however, as it is measuring precisely this type of variation in the genome, provides an exciting opportunity to finally assess the association between gene copy number and phenotype.