CNV analysis in a large schizophrenia sample implicates deletions at 16p12.1 and SLC1A1 and duplications at 1p36.33 and CGNL1

Large and rare copy number variants (CNVs) at several loci have been shown to increase risk for schizophrenia. Aiming to discover novel susceptibility CNV loci, we analyzed 6882 cases and 11 255 controls genotyped on Illumina arrays, most of which have not been used for this purpose before. We identified genes enriched for rare exonic CNVs among cases, and then attempted to replicate the findings in additional 14 568 cases and 15 274 controls. In a combined analysis of all samples, 12 distinct loci were enriched among cases with nominal levels of significance (P < 0.05); however, none would survive correction for multiple testing. These loci include recurrent deletions at 16p12.1, a locus previously associated with neurodevelopmental disorders (P = 0.0084 in the discovery sample and P = 0.023 in the replication sample). Other plausible candidates include non-recurrent deletions at the glutamate transporter gene SLC1A1, a CNV locus recently suggested to be involved in schizophrenia through linkage analysis, and duplications at 1p36.33 and CGNL1. A burden analysis of large (>500 kb), rare CNVs showed a 1.2% excess in cases after excluding known schizophrenia-associated loci, suggesting that additional susceptibility loci exist. However, even larger samples are required for their discovery.


Sample Description
Discovery sample: The 7129 discovery cases came from samples we call the CLOZUK (n=6,558) and CardiffCOGS (n=571) series which have in part been described previously (1-3). The CLOZUK sample consists of patients taking clozapine, who provide regular blood samples to allow early detection of adverse effects of that treatment. Through collaboration with Novartis, the manufacturer of a proprietary form of clozapine (Clozaril), we acquired blood from people with schizophrenia who were taking the drug via the central processing labs of a clozapine blood monitoring service. After the samples had been used to complete the necessary laboratory tests, unused fractions were sent to Tepnel Life Sciences (Paisley, UK) for DNA extraction. Samples were anonymous, only basic demographic and diagnostic details being made available. Subjects (71% male) were UK residents, aged 18-90 with a recorded diagnosis of treatment resistant schizophrenia according to the clozapine registration forms completed by the treating psychiatrists. In the UK, treatment resistant schizophrenia implies a lack of satisfactory clinical improvement to adequate trials of at least two other antipsychotics. Approval by the local ethics committee was granted for the use of these samples in genetic association studies.
The CardiffCOGS is a sample of clinically diagnosed schizophrenic patients from the UK.
Interview with the SCAN instrument (4) and case note review was used to arrive at a bestestimate lifetime diagnosis according to DSM-IV criteria (5).
All cases were genotyped on either HumanOmniExpress-12v1 or HumanOmniExpressExome-8v1 arrays at the Broad Institute, Cambridge, Massachusetts.
All controls for the discovery sample were downloaded with the relevant approvals for our study from the online repositories Database of Genotypes and Phenotypes (dbGaP) and the European Genome-Phenome Archive (EGA). The four non-psychiatric control datasets obtained, totalling 12,080 samples, are summarised in Table S1. We purposefully selected datasets that were genotyped on high density Illumina arrays to maximise probe overlap with the Illumina arrays used to genotype cases.

Dataset
Source ( them are of mixed or Indian origin, and they were classed as "others" in our analysis. Irish sample: Details of these samples have been published previously (9). WTCCC2 samples that overlapped with our discovery sample were excluded. Calls in the Irish schizophrenia sample were created using Birdseye from Birdsuite (version 1.5.5)(7) for autosomes and we excluded calls where lengths were <100kb or >10Mb, or LOD score <10.

Replication samples
We excluded CNVs with at least 50% overlap with other regional CNVs present in 1% or more of the samples. We excluded individuals with >30 CNV calls, or a total CNV length >10Mbp. Calls from plates containing fewer than 40 samples were also excluded. deviation from Hardy-Weinberg equilibrium (P < 10 −6 in controls or P < 10 −10 in cases).
The Birdseye tool in Birdsuite (7) was applied to intensity data from SNP and CNV probes.
The Birdseye algorithm uses a hidden Markov model (HMM) approach to find regions of variable copy number in a sample. Model priors were generated for each genotyping platform. All genomic positions were mapped to the hg19 coordinates.
A multi-step quality control (QC) procedure was implemented in order to assemble a highquality rare CNV callset. Samples were excluded if they failed SNP QC or if they had > 40 CNV calls or > 10Mb of CNVs(8). CNVs were excluded if they were of low confidence (LOD <10, size < 20kb, or spanning < 10 probes) or if they overlapped large genomic gaps (≥1kb overlap). Any CNVs that appeared to be artificially split by the HMM were annealed. Next, we imposed a 1% frequency threshold by removing any CNV with > 50% of its length spanning a region with CNVs from >1% of total samples as implemented in PLINK (24). After Clinicians reviewed diagnoses that were based on DSM-IV (5). Cases were included in the current study if they met criteria for schizophrenia or schizoaffective disorder. Individuals without a personal or family history of psychosis or mania were eligible to participate as controls. In the current study, we genotyped samples from the GPC cohort members with self-reported African American ancestry. Controls were asked to complete a questionnaire to assess their psychiatric history and the psychiatric history of all first-degree relatives.
Individuals reporting any lifetime symptoms indicative of psychosis or mania were excluded as control participants CNVs were called on all samples using PennCNV and NCBI37/hg19 coordinates. The following samples were removed: duplicate individuals, first degree relatives (if discordant phenotypes, always the control was removed), individuals with more than 2% missing genotypes, individuals with more than 60% European ancestry, individuals with more than 10Mbp of the genome estimated as CNV. After quality control, CNVs were analysed in 1,637 cases and 960 controls.

Discovery sample quality control
Raw intensity data from each case/control dataset were independently processed and analysed to account for potential batch effects. Log R ratios (LRR) and B-allele frequencies were generated using Illumina Genome Studio software (v2011.1). CNVs were called using the PennCNV calling algorithm, following the standard protocol and adjusting for GC content.
CNVs were called using the 520,766 probes common to all discovery arrays to avoid a cross-platform CNV locus detection bias. Samples were excluded if for any one of the following QC metrics they represented an outlier in their source dataset: LRR standard deviation, B-allele frequency drift, wave factor and total number of CNVs called per person.  Following the exclusion of poorly performing samples, we performed quality control on the called CNVs. Firstly, CNVs in the same individual were joined if the distance separating them was less than 50% of their combined length using a custom developed open source programme (http://x004.psycm.uwcm.ac.uk/~dobril/combine_CNVs/). All CNVs were then excluded if they were covered by less than 10 probes, were less than 10kb in length, overlapped with low copy repeats by more than 50% of their length, or had a probe density (calculated by dividing the size of the CNV by the number of probes covering it) greater than 1 probe/20kb. CNV loci with a frequency > 1% of the total discovery sample were excluded using PLINK (24).
The remaining rare CNVs were required to pass a median Z-score outlier method of validation. This method is detailed in Kirov et al, 2012 (25) and Rees et al (3). Briefly, each probe intensity within an individual is converted to a Z-score, which is the probe intensity standardised across all probes within that individual, and then standardised for that probe across all individuals. These rounds of standardisation help reduce noise created by natural fluctuations in probe intensity. A median Z-score value for all probes within a putative CNV region is used to assess copy number, with true deletions and duplications represented as outliers in the samples median Z-score distribution. Each CNV in every individual was assigned a Z-score. CNVs with Z-scores of <-6 were accepted as true deletions, while those with Z-sores of >+3 were accepted as duplications. The Z-score histograms of CNVs with marginal Z-Scores (deletion Z-score between -4 and -6 and duplication Z-score between +2 and +3) were manually inspected, and from these CNVs the LRR and B-allele frequencies of those with ambiguous Z-scores were visually inspected with the Illumina GenomeStudio v2011.1 software. This resulted in 2,569 CNVs being filtered out from the data.

Previously implicated CNV regions.
To identify novel CNV associations, we excluded from our analysis regions previously implicated in schizophrenia. These regions, along with the number of CNVs observed in our discovery sample, are presented in Table S3. Some or all of the 4,939 WTCCC2 control samples have been included in previous reviews, or in papers that implicated these loci, so the numbers for the control populations are not entirely independent from previous reports.
The evidence that these loci are implicated in SZ is presented in Rees et al (3).   Table S4. Burden of large CNVs before and after removing implicated loci. CNVs are stratified by type (all, deletions only, duplications only) and size (500kb -1Mb and > 1Mb).

UCSC tracks of significant loci
Location of CNVs that remained significant with a combined discovery/replication Cochran-Mantel-Haenszel test ( Table 1 in