Detection and correction of artefacts in estimation of rare copy number variants and analysis of rare deletions in type 1 diabetes

Copy number variants (CNVs) have been proposed as a possible source of ‘missing heritability’ in complex human diseases. Two studies of type 1 diabetes (T1D) found null associations with common copy number polymorphisms, but CNVs of low frequency and high penetrance could still play a role. We used the Log-R-ratio intensity data from a dense single nucleotide polymorphism (SNP) array, ImmunoChip, to detect rare CNV deletions (rDELs) and duplications (rDUPs) in 6808 T1D cases, 9954 controls and 2206 families with T1D-affected offspring. Initial analyses detected CNV associations. However, these were shown to be false-positive findings, failing replication with polymerase chain reaction. We developed a pipeline of quality control (QC) tests that were calibrated using systematic testing of sensitivity and specificity. The case–control odds ratios (OR) of CNV burden on T1D risk resulting from this QC pipeline converged on unity, suggesting no global frequency difference in rDELs or rDUPs. There was evidence that deletions could impact T1D risk for a small minority of cases, with enrichment for rDELs longer than 400 kb (OR = 1.57, P = 0.005). There were also 18 de novo rDELs detected in affected offspring but none for unaffected siblings (P = 0.03). No specific CNV regions showed robust evidence for association with T1D, although frequencies were lower than expected (most less than 0.1%), substantially reducing statistical power, which was examined in detail. We present an R-package, plumbCNV, which provides an automated approach for QC and detection of rare CNVs that can facilitate equivalent analyses of large-scale SNP array datasets.


Quality control approaches applied in other CNV studies
There seems to be no consensus on what quality control procedures work best, and it may be that the best approach varies between datasets and chips. Following are some representative approaches used by other groups. The first (S1) used the intersection of three CNV calling algorithms, QuantiSNP, PennCNV and GNOSIS. For each they varied the QC applied. For PennCNV they used the built in LogR standard deviation, BAF drift and Waviness factor cutoffs. Waviness factor as is a measure of wave artifact (predominantly GC wave), is detected and corrected automatically as part of the PennCNV calling software. For QuantiSNP they used a percentage threshold for outlier probe intensities, and filtered for extreme BAF and LogR standard deviation. For GNOSIS they required a quality score greater than 10. For all methods they excluded samples with a large number of large negative valued probes and a SNP call rate cutoff of 98.5%. This approach yielded a positive predictive value of 91% with PCR validation in their autism dataset. The online PennCNV documentation recommended several quality control procedures, including those mentioned above, alongside a suggestion to remove samples or plates with large numbers of CNVs detected. The Broad Institute developed a suite of microarray analysis tools called birdsuite (S2) that incorporates a CNV calling procedure for rare CNVs (birdseye) and common CNPs (Canary). The reported processing time is orders of magnitude slower than plumbCNV with estimates of 2-5 hours per 96 samples. The software only currently supports Affymetrix SNP5/SNP6 arrays. Processing requires data to be pre-prepared and treated one plate at a time. The algorithm uses plate and chromosome level intensity distributions to normalize the data within each plate. Extreme values on noise measures are used to flag samples for exclusion, alongside use of a quality score for each CNV, LOD (likelihood of duplication/deletion) which quantifies the likelihood that probes are part of a CNV rather than the copy state of the surrounding probes.
In a study designed to examine the sensitivity of a microarray-based calling method to reproduce CNVs detected by sequencing, birdsuite significantly outperformed two other CNV calling algorithms. Results were: 32.8%, 61.6%, 64.8%, 93.8% for CNVs with <5, 6-10, 11-20, 20+ probes respectively. Clearly CNVs tagged by a greater number of probes are more easily detected. The other methods (Nexus, Partek) showed lower sensitivities in this test: 5.5%, 42.0%, 69.0%, 74.4% for Nexus with 'relaxed' settings, and 0.4%, 13.4%, 47.9%, 72.9% for Partek [S2]. Most LRR-QC approaches use filters based on extreme intensity values. Many apply SNP quality control prior to calling (e.g, call rates), and filter by some sort of quality score. Birdsuite seems to perform well based on validation metrics, however the process is slow and cumbersome (though comprehensive) and limited to Affymetrix arrays. The Sanders method, implemented as a pipeline called CNVision was shown to perform well in the context of several autism case control studies. However, for application to the current dataset the method in [S1] has no way to deal with systematic intensity biases between cohorts collected and typed separately.

Validation using common samples with a MetaboChip dataset
To test the reliability of CNV calls from this pipeline, CNV detection and QC was run on a set of 5,458 samples that had also been typed on another Illumina custom iSelect chip, MetaboChip, which has 5.9% SNPs in common with ImmunoChip (n=11,622). CNVs detected by MetaboChip were filtered to select only those covered by at least five ImmunoChip SNP positions (regardless of whether they were the same SNPs). Then the Immunochip data in these locations was tested for the presence of the same CNV in the same sample. Sensitivity for rDELs was initially low but inspection of the data showed this was due to a specific artifact in the MetaboChip data, caused by a high rate of monomorphic SNPs. This artifact was clearly identifiable on visual inspection but had not been picked up by any QC. This prompted the creation of a CNV quality score.

CNV Quality Score
The score was comprised of two parts: (i) a BAF quality score based on the emmisions matrix for the PennCNV HMM, namely the likelihood that the CNV SNP intensities were in a CNV state and that the surrounding SNP intensities reflected a normal state; (ii) an LRR quality score based on the percentage of CNV SNP intensities on the expected side of zero, versus the percentage of SNP intensities on and around zero for the surrounding region.
This quality score proved to be sensitive to the problematic artifact in the MetaboChip CNVs. Revised results filtered by quality score showed maximum sensitivity of 87.7% and specificity of 93.5%, at a cutoff of .95 (quality score out of 1.0) with an area under the curve of .932. For rDUPs, 100% of MetaboChip rDUPs were found on ImmunoChip, with 88.2% specificity, regardless of whether the quality threshold was applied. Note that the quality scores have only been shown to correct the artifact that was causing problems in the MetaboChip rDEL set, not in ImmunoChip DELs. Furthermore, the number of detected CNVs in the same samples was only 95 rDELs and 54 rDUPs, so it is possible that this correspondence between chips is not representative. See supplementary Figure 5 for a plot of quality score against number of SNPs per CNV (more tagging SNPs generally thought to increase quality).
The score is implemented in the plumbCNV function 'process.quality.scores()', which can be run on any existing plumbCNV session, but is also automatically run during normal operation, so quality score tables should be available in the plumbCNV output folder 'RESULTS'. We found that a quality score threshold of 0.95 for rDELs and 0.7 for rDUPs was optimal for our dataset. Another extremely useful criteria implemented was to exclude an entire CNVR if a large proportion of CNVs in that CNVR were excluded for having a low quality score. This threshold is quite artibitrary, but for higher confidence we recommend a stricter threshold. For instance, we have widely used 80% as a threshold (80% in a CNVR must pass the quality score threshold), which we observed was effective at identifying CNVs known to be spurious that had passed other QC criteria.

Systematic correlations of gender and age with LRR intensity
A previous study (43) using the same T1D family dataset as the present study identified a number of technical artifacts including an age effect on intensity at chr14q11.2, chr7q34, and chr14q32.33. The current dataset was tested using genome wide correlations for the same effect but there were only three locations passing a bonferroni adjusted p-value for an age effect, and these did not show the same sort of smooth effect when plotted against age, and occurred in unconnected regions. Targeted plotting of SNPs tagging the same regions did reveal a similar shaped trend at some sites to those in Figure 3 of (43), however these did not reach statistical significance at an adjusted p-value, and furthermore the majority of locations in these 3 regions showed inconsistent shapes. So it was concluded that these location specific age intensity confounds seen previously (43) were not observed convincingly in the current dataset.
Gender differences in intensity are a commonly observed phenomenon, particularly in the context of genotype calling where males and females can have slightly offset cluster centres. Genome wide associations between LRR intensity and gender were significant at an adjusted p-value at 32.1% of all SNP locations prior to PCcorrection. After 24-PC correction 8.2% sites still showed differences, although plumbCNV has an option to correct for gender differences, which, as expected, left no differences.

PCR validation of biases
The most notable plate-specific artefacts (Methods) were detected for three gene regions: (1) within the HOXA1-HOXA13 set of paralogous genes at chromosome 7p15.2 for a 46.5 kb CNVR, with 30:2 case:control rDEL calls, (2) 20 kb downstream from the common HLA CNVR at 6p21.33 for a 1.2 kb CNVR that included 23:1 case:control rDEL calls at HCG9, and (3) on 12q14.1 at AGAP2 for an 11 kb CNVR with 9:0 case:control rDEL calls. One of the 15 excluded plates consisting of the highest number of rDEL carriers (n=37) had a sample-based LRR mean of 0.02 and DLRS of 0.09, well within the normal limits detected across all other plates (see Figure 9 for plate-based LRR and DLRS distributions). Thus, the artefactual CNV calls on this plate were not identified through the array-wide LRR and DLRS metrics.
Given the repetitive architecture of the 13 HOXA genes, we tested the HOXA1-HOXA13 region for duplicate mappings of ImmunoChip assays and tested the identity between the paralogous genes. No evidence for cross-hybridisation for any of the probes was detected, and none of the overlapping genes contained sequence duplication. The HCG9 rDEL calls, in immediate vicinity (20 kb) of a common CNVR in the HLA region, may reflect a boundary-specific spurious signal or, as referred by other recent studies, a 'peninsula' effect of a common CNVR generated by variation in boundary truncation in CNV calling. None of the HOXA1-HOXA13 rDELs were successfully validated with qPCR, using ABI Hs02286016_cn, Hs00252571_cn, Hs03653822_cn, and Hs03623434_cn assays. Since no qPCR assays were available for the HCG9 rDEL, a conventional PCR and agarose gel analysis was used with two sets of primers: A_F accgcgtaccaaacaataatataga, A_R cttcctattgctgtggggactat, and B_F aattcccaataaccaaattcatagag, B_R ctaactttctggtgctgccttt. We could not validate the HCG9 rDEL.
For the third plate-specific CNVR, we were able to validate the nine AGAP2 rDEL calls for the T1D cases by using ABI qPCR assays Hs07012691_cn and Hs02280355_cn. However, since we did not observe any rDEL calls in controls, and cell line production and propagation can be associated with genomic mutations and CNVs, we analysed an additional 96 cases and 96 controls using blood (genomic) DNA to confirm the frequency of AGAP2. qPCR assay of these blood DNA samples showed high variability in copy number, with the largest variation caused by DNA extraction method (chloroform versus Qiagen kit). In addition, we also found high variability in CNVR copy number within each of the DNA preparation methods. This difference was present across multiple probes in the AGAP2 CNVR (Supplementary Figure 16). The probe sets, although validated by the manufacturer, did not perform reproducibly in DNA samples isolated by different methods. It may, therefore, be possible that impurities in DNA samples may contribute to unpredictable changes and discrepancies in this region when analysed by array or PCR assays, as we have proposed previously (S3). Therefore, given the plate-specific artefacts, all samples from the outlier plates were removed, removing the enrichment initially detected for these three CNVRs.

List of PCR replication locii
The following loci were subjected to validation using PCR during the validation of the CNV-QC pipeline. Those that failed to replicate were generally CNVs that showed a case-control association before the QC method was refined to deal with a new type of artifact.  to replicate:  ADAMTS7, AGAP2, AIRE, BCAR1, C20orf94, CABP7, CTRB2, DEAF1, DGKQ,  FOXP3, HCG9, HOXA3, IFIH1, INS, PTPN2, SLC26A1,  *Note that DAP3 was only replicated in African Americans. KIR3DL2 CNVs were successfully detected in this analysis, but excluded as the incidence was above the 1% threshold for rare CNVs.

Successful qPCR validation of two CNVRs
During QC of CNVRs, it became apparent that quite often the CNVRs with the most significant case-control p-values, were the most affected by artifact. After refining the QC to a level that seemed to produce valid results, we tested some of the top hits using qPCR. Even though with bonferroni correction these were not confirmed casecontrol differences, these are still probably the best targets to demonstrate that the pipeline can produce valid CNV calls that can differ between cases and controls.
rDELs at PKIA, a gene located on chromosome 8q21.12 had nominal evidence for association with T1D, with six case rDELs and one control rDEL of median size 23 kb (OR=8.59, p FET = 0.022). PKIA, is a member of the cAMP-dependent protein kinase (PKA) inhibitor family, and is a potent competitive inhibitor of cAMPdependent protein kinase activity. All PKIA rDELs were confirmed with quantitative polymerase chain reaction (qPCR) (Hs06173614_cn, Hs06242849_cn and Hs05059220_cn, Supplementary Table 7).
Another large CNVR, ~500 kb on chromosome 16p11.2, has been widely reported in the literature as contributing to the genetic risk of obesity (11-15). All SH2B1 CNVs tested (rDELs and rDUPs) were confirmed by qPCR (Hs02927996_cn and Hs01749891_cn, Supplementary Table 7). To explore whether there was any evidence for an effect of copy number dosage for SH2B1 rDUPs, we also compared rDEL frequencies amongst the case-control and family datasets (Supplementary Methods Table 3). Each of these were replicated using PCR. Testing the overall affected versus unaffected ratio in the family data using FET is not recommended due to irreconcilable confounds of family structure. However, for this rDEL, there was only one trio in which a parent transmitted to an affected child (two were de novos and the remainder were found only in parents) so the TDT analysis had no chance to capture an effect. Combining the results for the family and case-control datasets resulted in 13 rDELs in T1D samples to three in controls. N

Assessment of Sensitivity and Specificity when varying HMM parameters
Tests were run on the family dataset to calibrate the sensitivity of duplicate and deletion HMM parameters, by examining denovo and transmission rates. Testing suggested that the expected value for detecting copy 'three' DUPs should be slightly lowered from the default, which may better deal with the intrinsically lower effect size of LRR change for duplications. Using the copy-three rDUP expected value of 0.2 instead of 0.4, increased sensitivity by ~15% whilst maintaining specificity. The expected value for detecting copy 'one' DELs was changed from -0.65 to -0.55, based on small improvements in de novo and transmission differences with and without trios, alongside a 4% improvement in specificity and more than 15% improvement in sensitivity. The stronger threshold was avoided due to the increased de novo rate. Note that the de novo rates with trio calling are higher than reported in the main paper (e.g, ~6% versus ~1%), this is because these estimates include CNVs with population frequency between 1% and 5%, and because exclusions from manual checking of intensity plots were not applied. Each pair of rows of Supplementary Methods Table 4 compare CNV calling performance without, then with trio-calling. Trio calling generally results in less CNVs, as whilst some extra CNVs are identified, more are discarded due to not being present in other family members, and the confidence of the call being too low to claim a de novo. Transmission is the percentage of CNVs in parents passed to all children.

rDELs
The de novo estimate is the number of CNVs found in children but not in either of their parents. The CNV count is the final number of rDELs or rDUPs in that particular test run. The specificity percentage is the rate of CNVs in the set that doesn't use triocalling, versus the set that does. The internal specificity is the percentage of (validated within a family member) trio-called CNVs that were present in the original set without trio-calling. The max-set specificity uses the run with the highest sensitivity to set the denominator for the sensitivity percentage, instead of the current run. A copy-1 (rDEL) mean of 0.55 was judged to be optimal based on specificity and sensitivity comparable to the mean=0.4 run, but with lower de novo rates. A copy-3 (rDUP) mean of 0.2 was judged to be optimal based on maximising sensitivity and specificity whilst showing de novo and transmission rates comparable to the runs using more conservative thresholds. Note that specificity is stable alongside increases in sensitivity, this is due to the moderating influence of quality scores, which filter out many false positive hits. *= default PennCNV parameters, HMM model means: Copy 1 ~ -0.65, Copy 3 ~0.4.

Estimating CNV detection accuracy via de novo and transmission rates
An estimate of the overall sensitivity of the CNV detection pipeline, incorporating the limits of the technology, datasets, and the impact of the QC, can be made using the transmission rates in familial samples (for which the calling was not informed by the family structure) divided by the expected transmission rate of 50%. Trio calling using family data uses extra prior information (family structure) in addition to the LRR and BAF signals to call CNVs. For instance, if a CNV is found in a child but not a parent, or vice versa, that region may be reevaluated within that family for a CNV using a more relaxed detection threshold. Trio calling should yield rates of transmission closer to the expected 50% with a correspondingly lower frequency of de novo CNVs. This estimate (derived by doubling the overall percentages in the top two rows of Supplementary Methods Table 5) was 79.8% for rare CNV deletions (rDELs) and 74.6% for rare CNV duplications (rDUPs). The estimate represents a lower bound because some of the CNVs not transmitted are likely to be false positives. An upper bound was estimated as the proportion of final trio called sets of rDELs and rDUPs that were present in the non-trio sets, yielding 87.6% for rDELS and 87.9% for rDUPs. An upper estimate of the specificity was equivalently made using the change in de novo rate with trio calling, as the 'de novos' that are excluded once family data is better accounted for can confidently be called false positives. Furthermore, for a more direct comparison with the larger case-control dataset, this transmission ratio was alternatively calculated without using the family structural information for CNV calling. The rates for rDELs were still equal, giving a ratio of 1.0 (39.9% for both affected and unaffected), showing that there was no bias, regardless of whether the greater accuracy afforded by joint calling was available.

Control versus control testing
For overall CNV counts versus samples passing QC, the ratio between the two control cohorts was 0.99 (Sanger: 573/4213, UVA: 691/5025) for rDELs and 1.05 (Sanger: 1151/4213, UVA: 1306/5025) for rDUPs.. These cohorts were genotyped at different times in different centres, and had different proportions of samples with cell-line versus genomic DNA samples, which were major potential sources of bias that seem to have been overcome.
In order to further demonstrate lack of bias for burden tests, a comparison over lengths was made between the UVA and Sanger control cohorts. It shows that despite batch effects, there was still a stable ~1.0 ratio between control groups (Supplementary Methods

Examing the distribution of CNV lengths between cases and controls
Differences in length distribution could either indicate systematic differences in array sensitivity or a widespread burden effect. The length distribution of CNVs was examined in cases versus controls. Conventional GWAS arrays detect CNVs at the minimum 20 kb threshold with a high degree of confidence. In contrast, the ImmunoChip allowed reliable identification of CNVs much smaller than 20 kb that were detected in 62% of rDEL carriers. Overall, we did not find a significant difference in rDEL or rDUP length between case and control carriers (P>.05), with nearly identical medians, and first and third quartiles. For rDELs, a median size of ~13 kb was estimated in both cases and controls, ranging between 0.6 kb and 23.5 Mb, and with first and third quartiles of 6.7 kb and 33.5 kb. rDUPs were substantially longer, with a median size of ~86 kb in both cases and controls, ranging between 0.6 kb and 9.7 Mb, and with first and third quartiles of 28.1 kb and 196.7 kb.
The largest rDEL in the case-control dataset was 23 Mb located on chromosome 6 (position 97.55-121.05 Mb). This result could have been biased by the chromosome aberrations QC, which excludes samples that have high or low intensity for an entire chromosome. The algorithm is designed to automatically exclude samples with aneuploidy, but could also be triggered by very large CNVs. So feasibly, a higher rate of sample exclusion for controls with very large CNVs could have influenced this burden result. However, the opposite was true. The relative frequencies of outlier samples excluded for extreme chromosome mean intensities were consistent with the burden result, whereby T1D cases had a higher frequency of chromosome-wide intensity differences than controls, OR=1.51, FET(175,183), p-value = 1.4x10 -4 .    Figure 6: Exemplar data for GC wave artifact in a single sample An example of GC Wave, showing a close look at the LRR intensity for a single sample, overlaid with GC. The LRR (green wave) is median-smoothed at a resolution suited to highlight the wave fluctuations, which results in a narrower y-axis range than the previous LRR plots. The black line above the LRR observations is GC percentage, calculated for each 1 megabase window. Each significant GC peak corresponds with a major LRR peak. To preserve sample anonymity the sample ID is a fake id for display purposes only and is not connectable to any record.

Supplementary Table 2b. Summary of T1D cases and unaffected in families that failed one or more QC metrics and of the total number of samples submitted for burden of DEL tests.
Note * Sample exclusions based on LRR QC tests are not independent measures, i.e. the total sample failure is not equivalent to the sum of all sample QC failures, as several samples have failed more than one LRR QC; ** Additional exclusions were made based on sex mismatch, sample duplicates, and non-Caucasian ancestry; ^ Plate QC test excluded 7 plates for which the sample failure percentage was above 40% per. Note that because the family dataset was larger than can be handled by a SnpMatrix object (R-snpStats package), QC was done in plumbCNV via Plink which did not provide separate counts for HWE/Call-rate and did not provide sample-heterozygosity calculations, and so this is why the information in the SNPQC rows differs slightly versus the casecontrol version of the same Note * Samples were excluded that had too many overall CNVs (i.e CNVs + CNPs) because this is an indicator of poor data quality. The 'total CNV' column shows only sample and CNV counts of excluded rare CNVs. The thresholds for rDUP exclusion are more lax than for rDEL exclusion, as validation of the same samples in a different dataset (Supplementary Methods section 5) showed these to be optimal thresholds. Note that the length sample exclusion is only used for overall burden and gene/exon overlap tests, not for analysis of length. 'Stringent' sample QC comprised exclusion of outliers for LRR mean, LRR DLRS and GC-Wave at the lower and upper bound thresholds, stringent exclusion of chromosomal aberrations, exclusion of plates with more than 30% samples excluded on these former criteria. 'Medium' sample QC used 3.5 standard deviations as the threshold for LRR Mean, DLRS and GC-Wave, used a more relaxed threshold for chromosomal aberrations, and excluded plates with more than 40% of samples excluded. 'Complete' SNP-QC involved exclusion of samples for low call rate, exclusion of SNPs with call rate below 99%, exclusion of SNPs violating a HWE threshold of 10^-4, exclusion of SNPs with significant differences in missingness or HWE between cohorts, and exclusion of samples with extreme heterozygosity. 'Medium' SNP-QC used a SNP call rate of 95%, a HWE threshold of 10^-10, and did not apply the exclusion for cohort differences.

Supplementary
There are several contexts to consider for expected sensitivity and specificity. Firstly, as a 'gold-standard', results from the birdsuite validation [3] showed sensitivity ranging from ~30% to ~60% in the range from 6-20 probes per CNV, and this significantly outperformed other methods from 2008. Another group found roughly 67% specificity with 50% sensitivity using Illumina SNP arrays, but this included common CNVs [21]. Notably, if sensitivity varies a lot, then the 'rare' frequency cutoff could artificially reduce sensitivity. Detection of CNVs at a rate below the true population frequency could readily occur in the present study, or may have been the case in some DGV studies. It may be many CNVs found previously are so rare that we are unlikely to find them. The expected distribution for the population frequency of CNVs present within an individual is thought to be an exponentially decreasing function, implying that within an individual most CNVs are relatively common, but within a population most CNVs are rare, including CNVs unique to certain persons. Furthermore, after a recent update in the DGV it is very difficult to filter studies by the ethnicities included, so some DGV CNVs could be specific to non-european populations, making them difficult to detect in the current study of European samples. Lastly, QC for CNVs is still being developed so it is highly likely that the DGV itself contains many false-positive and falsenegatives. Given the performance of birdsuite, 60% was the absolute upper level of sensitivity expected, although considering the limitations discussed we thought it more likely to achieve half this accuracy.