Mapping eQTLs With RNA-Seq Reveals Novel SLE Susceptibility Genes, Non-Coding RNAs, and Alternative-Splicing Events That Are Concealed Using Microarrays

Studies attempting to functionally interpret complex-disease susceptibility loci by GWAS and eQTL integration have predominantly employed microarrays to quantify gene-expression. RNA-Seq has the potential to discover a more comprehensive set of eQTLs and illuminate the underlying molecular consequence. We examine the functional outcome of 39 variants associated with Systemic Lupus Erythematosus (SLE) through integration of GWAS and eQTL data from the TwinsUK microarray and RNA-Seq cohort in lymphoblastoid cell lines. We use conditional analysis and a Bayesian colocalisation method to provide evidence of a shared causal-variant, then compare the ability of each quantification type to detect disease relevant eQTLs and eGenes. We discovered a greater frequency of candidate-causal eQTLs using RNA-Seq, and identified novel SLE susceptibility genes that were concealed using microarrays (e.g. NADSYN1, SKP1, and TCF7). Many of these eQTLs were found to influence the expression of several genes, suggesting risk haplotypes may harbour multiple functional effects. We pinpointed eQTLs modulating expression of four non-coding RNAs; three of which were replicated in whole-blood. Novel SLE associated splicing events were identified in the T-reg restricted transcription factor, IKZF2, the autophagy-related gene WDFY4, and the redox coenzyme NADSYN1, through asQTL mapping using the Geuvadis cohort. We have significantly increased our understanding of the genetic control of gene-expression in SLE by maximising the leverage of RNA-Seq and performing integrative GWAS-eQTL analysis against gene, exon, and splice-junction quantifications. In doing so, we have identified novel SLE candidate genes and specific molecular mechanisms that will serve as the basis for targeted follow-up studies.


22
Studies attempting to functionally interpret complex-disease susceptibility loci by GWAS and 23 eQTL integration have predominantly employed microarrays to quantify gene-expression. 24 RNA-Seq has the potential to discover a more comprehensive set of eQTLs and illuminate 25 the underlying molecular consequence. We examine the functional outcome of 39 variants 26 associated with Systemic Lupus Erythematosus (SLE) through integration of GWAS and 27 eQTL data from the TwinsUK microarray and RNA-Seq cohort in lymphoblastoid cell lines. 28 We use conditional analysis and a Bayesian colocalisation method to provide evidence of a 29 shared causal-variant, then compare the ability of each quantification type to detect disease 30 relevant eQTLs and eGenes. We discovered a greater frequency of candidate-causal eQTLs 31 using RNA-Seq, and identified novel SLE susceptibility genes that were concealed using 32 microarrays (e.g. NADSYN1, SKP1, and TCF7). Many of these eQTLs were found to 33 influence the expression of several genes, suggesting risk haplotypes may harbour multiple 34 functional effects. We pinpointed eQTLs modulating expression of four non-coding RNAs; 35 three of which were replicated in whole-blood. Novel SLE associated splicing events were 36 identified in the T-reg restricted transcription factor, IKZF2, the autophagy-related gene 37 WDFY4, and the redox coenzyme NADSYN1, through asQTL mapping using the Geuvadis 38 cohort. We have significantly increased our understanding of the genetic control of gene- Integrative studies using RNA-Seq to functionally annotate complex-disease susceptibility 66 loci however have been limited (35,(40)(41)(42)(43)(44). Moreover, numerous investigations have aimed 67 to explain the functional relevance of susceptibility loci by interrogation of GWAS SNPs 68 themselves in eQTL datasets and simply testing for association with gene expression (45- 69 47). Such inferential observations should be treated with caution as they may possibly be the 70 result of coincidental overlap between disease association and eQTL signal due to local LD 71 and general ubiquity of regulatory variants (48). This has become particularly important as 72 statistical power in eQTL cohorts grow and availability of summary-level data accession 73 through eQTL data-browsers increases (49)(50)(51). 74 75 In this investigation, we integrate eQTL data derived from both microarray and RNA-Seq 76 experiments with our GWAS results in Systemic Lupus Erythematosus (SLE [MIM: 152700]); 77 a heritable autoimmune disease with undefined aetiology and over 50 genetically associated 78 loci (52-54). We use summary-level cis-eQTL results in lymphoblastoid cell lines (LCLs) 79 taken from the TwinsUK cohort to directly compare the microarray (9) and RNA-Seq (39) 80 results in detecting SLE associated eQTLs along with their accompanying eGenes. We 81 apply a rigorous two-step approach -a combination of conditional (55) and Bayesian 82 colocalisation (56) analysis -to test for a shared causal variant at each locus. We 83 demonstrate the benefits of using RNA-Seq over microarrays in eQTL analysis by identifying 84 not only novel SLE candidate-causal eGenes but also putative molecular mechanisms by 85 which SLE-associated SNPs may act; including differential exon usage, and expression 86 modulation of non-coding RNA. Our investigation was extended to include RNA-Seq 87 expression data in whole blood in order to validate the eQTL signals detected in LCLs and 88 uncover the differences in genetic control of expression between cell-types. Finally, we 89 interrogate the Geuvadis RNA-Seq cohort (35) to identify SLE associated alternative-splicing 90 quantitative trait loci (asQTLs) and highlight the advantages of profiling at various resolutions 91 to detect eQTLs that would otherwise remain concealed. Through functional annotation of 92 SLE associated loci using microarray and RNA-Seq derived expression data, we have 93 supplied comprehensive evidence of the need to use RNA-Seq to detect disease 94 contributing eQTLs and, in doing so, have suggested novel functional mechanisms that 95 serve as a basis for future targeted follow-up studies. 96

97
Discovery and classification of SLE candidate-causal eQTLs and eGenes 98 The first part of this study integrated the 39 SLE associated SNPs taken from a recent 99 GWAS in Europeans ( Table 1) with eQTLs from the TwinsUK gene-expression cohort 100 (n=856) profiled using microarray and RNA-Seq (at both gene-level and exon-level 101 resolutions). To accomplish this, we implemented a two-step pipeline (Fig. 1), and subjected 102 the genomic intervals within +/-1Mb of each of the 39 GWAS SNPs to eQTL association 103 analysis against expression quantifications in LCLs ( Table 2).

105
Full results of the conditional and colocalisation analysis for each significant association are 106 presented in S1 Table,  RNA-Seq (exon-level), respectively. Statistically significant SLE-associated cis-eQTLs 108 showing evidence of a shared causal variant or in very strong LD and close ranking between 109 the disease and cis-eQTL signal following conditional and colocalisation analyses were 110 classified as SLE candidate-causal eQTLs as stated in Methods. SLE candidate-causal 111 eGenes were defined as genes whose expression is modulated by the eQTL. The final 112 column of S1-S3 Tables indicates whether each GWAS SNP is deemed to be candidate-113 causal. These SLE candidate-causal eQTLs and eGenes are presented in separate tables 114 contingent on the dataset from which they were generated: results from microarray 115 assessment are listed in Table 3, from RNA-Seq (gene-level) in Table 4, and from RNA-Seq 116 (exon-level) in Table 5. Effect sizes are with respect to the minor allele; risk alleles are 117 highlighted in Table 1. Overall, exon-level analysis was the most effective quantification type 118 for the discovery of eQTLs and eGenes compared with gene-level RNA-Seq or microarray 119 analysis following an FDR cut-off of q < 0.05 and conditional and colocalisation thresholding 120 as described.   Breakdown of genotype-expression (eQTL) cohorts used in analysis. TwinsUK cohort in lymphoblastoid cell lines (LCLs) used for microarray and RNA-Seq comparison (profiled at gene and metaexon resolution); meta-exons are described as non-redundant overlapping portions of exons generated flattening of the transcriptome annotation. All TwinsUK (MuTHER) samples used in analysis are derived from the original 856 individuals. Validation of LCL data in whole blood carried out at meta-exon level using 384 of the 856 individuals. Geuvadis cohort used for asQTL identification; splice-junction quantifications were generated by Altrans (57) from the raw sequence alignments. Summary eQTL results include only the eQTL association results per test (where full genotype and expression data were not obtainable).
123   Fig. and S2 Fig.). A total of 14 eQTLs modulating expression of 34 eGenes were detected 140 using exon-level RNA-Seq contrasted to 11 eQTLs and 19 eGenes at gene-level RNA-Seq 141 and only 8 eQTLs with 12 eGenes identified using microarray (S1 Fig. and S2 Fig.). Only 142 one eQTL (rs2286672) and two eGenes (PDHB, INCA1) were found by microarray 143 exclusively. These associations were either not significant post multiple testing using either 144 RNA-Seq method, or were not deemed candidate-causal (S1 Table S1 and S3 Table). In 145 total 8 eQTLs regulating expression of 27 eGenes were detected using RNA-Seq but missed 146 using microarray (S3 Fig. S3 and S4 Fig.). Interestingly, exon-level analysis led to the 147 greatest frequency of non-candidate-causal associations. Only 14 of the 34 significant 148 eQTLs, q < 0.05, showed evidence of a shared causal variant post conditional and 149 colocalisation testing (S1 Fig. and S2 Fig.). One example of this effect is at the TCF7-SKP1 locus where the disease-association signal 166 encompasses both genes (S6 Fig.). Both TCF7 and SKP1 were classified as candidate-167 causal against the GWAS SNP rs7726414 using gene-level RNA-Seq ( Other examples where either gene-or exon-level RNA-Seq analysis identified multiple 186 eGenes are at rs3024505 (IL10, IL24, and FCAMR), rs12802200 and rs3794060 (Fig. 2). In 187 the immune-gene concentrated 11p15.5 region (S7 Fig.), the GWAS SNP rs12802200, is an  Table 6); none of which were captured using microarray. There is an rs3794060 allele-dependent expression modulation of both exons of another 254 non-coding eGene, RP11-660L16.2I (Fig. 4C), which is located in the bi-directional promoter 255 between DHCR7 and NADSYN1. The best eQTL for those exons is highly correlated with  Table   259 4).
260 To validate our LCL findings in a primary tissue-type, we extended our analytical pipeline to 264 include an exon-level RNA-Seq dataset in 384 whole-blood samples from the TwinsUK 265 cohort ( Table 2). The full results of these analyses are provided in S4 Table. 266 267 We observed good correlation between LCLs and whole-blood effect-sizes (β) of GWAS 268 SNPs against all matched cis exon-level associations (R 2 =0.74; S10 Fig.). whole-blood cell types (S11 Fig.). The remaining four eGenes specific to whole-blood were: 273 PXK (rs9311676); IRF7 and TALDO1 (rs12802200); and SCAMP2 (rs2289583) ( Table 7). 274 Interestingly, the eQTLs regulating these four eGenes in whole-blood also regulated multiple 275 eGenes in LCLs (S11 Fig.), implying that they tag highly regulatory haplotypes that may (57) (S12 Fig.). In whole-blood the GWAS variant, rs10028805, is associated with altered 289 expression of exon 2 (P=8.4x10 -05 ), with the best cis-eQTL for this effect being in near-290 perfect LD (rs4411998; r 2 :0.98). Both rs10028805 and rs4411998 are in strong LD with the 291 branch-point SNP (r 2 :0.9). In LCLs however, the best cis-eQTL for exon 2, rs4572885 292 (P=9.74 x10 -23 ), has a large effect but is less correlated with the GWAS SNP (r 2 :0.65) and 293 conditional analysis judges the effect of the GWAS SNP to be independent to the best cis-294 eQTL for exon 2 (S3 Table). Interestingly, there is low correlation between the branch-point 295 SNP rs17266594 and the best cis-eQTL for exon 2 in LCLs (r 2 :0.42); suggesting the 296 regulatory mechanism of exon 2 splicing in BANK1 may be under two separate genetic 297 influences between the two cell-groups (S12 Fig.).

299
We saw a near identical pattern of differential exon usage within eGene NADSYN1 between 300 LCLs and whole-blood driven by the GWAS SNP rs37940460 or tightly correlated variants 301 (S13 Fig., Table 7). Variation at rs37940460 appeared to drive extensive expression 302 disruption of two meta-exons (11 and 12) of NADSYN1 located near the centre of the gene 303 (meta-exon 11: LCL P=1.79x10 -60 ; whole-blood P=1.28x10 -27 ; meta-exon 12: LCL 304 P=1.06x10 -58 ; whole-blood P=6.30x10 -26 ). These two meta-exons were deemed to be 305 candidate-causal for SLE across both cell-types. We believe the meta-exons in the 3′ end of 306 NADSYN1 that are candidate-causal in LCLs (Fig. 3), are not detected in whole-blood may 307 be because of the smaller sample size or the mixed cell-type composition of the whole blood 308 cohort. This novel instance of specific exon expression disruption found in a primary cell-309 type at NADSYN1 may help to resolve the functional consequence of this locus. 310 GWAS SNPs deemed to be candidate causal eQTLs using RNA-Seq expression data profiled from 384 individuals of the TwinsUK cohort in whole blood at exon-level resolution. In total, 3,793 exons were tested against, corresponding to 654 genes.

312
Splice-junction quantification reveals asQTLs and additional candidate eGenes 313 We extended our investigation to determine whether the GWAS SNPs (Table 1) had a direct   314 influence on the alternative-splicing of transcripts (alternative-splicing quantitative trait loci; 315 asQTL), and whether expression quantification at this resolution would reveal any additional 316 candidate-genes or potential functional mechanisms. We undertook cis-asQTL analysis 317 within a +/-1Mb window around each GWAS SNP against 33,039 splice-junction 318 quantifications, corresponding to 817 genes, in the Geuvadis cohort ( Table 2). We identified 319 nine asQTLs significantly associated with 62 splice-junctions, corresponding to 10 eGenes 320 (S5 Table). After testing for a shared causal variant between the GWAS and asQTL signal, 321 six SLE candidate-causal asQTLs (26 splice-junctions) for seven eGenes remained ( amino-acids) of IKZF2 (Fig. 5A). Interestingly, this isoform possesses a premature 334 termination codon found on exon 6B that is not found on the canonical isoform 335 (ENST00000457361, 526 amino-acids) as in this isoform, exon 6A is spliced to exon 7 (Fig.   336   5B). This effect results in the premature truncation of the full-length protein and the 337 subsequent loss of the two zinc-finger dimerization domains found on exon 8 (Fig. 5B). The GWAS SNP and known asQTL, rs3757387, causes differential promoter usage of IRF5 371 (Interferon regulatory factor 5); a molecular mechanism that has previously been reported in  Finally, using splice-junction quantification, we were able to pinpoint the specific transcript of 382 NADSYN1 that drives the exon-level association previously described (Table 5, Fig. 3). The 383 GWAS SNP rs3794060 leads to substantial upregulation of the meta-exon 10 to meta-exon 384 12 splice-site (P=8.0x10 -12 ) which is unique the ENST00000528509 transcript of NADSYN1 385 (Table 8, S14 Fig.). As a consequence of this splicing event, it appears the meta-exon 11 to 386 meta-exon 12 splice-site is highly reduced (P=2.1x10 -14 ) with reference to the risk allele [C].  revealed that only 13 of the 39 risk alleles were eQTLs for a total of 15 eGenes (Table S6). 419 Therefore, in an attempt to increase our understanding of the extent to which eQTLs can 420 explain the functional consequences of our risk alleles and to achieve both of our major aims  Table). Many of the published SLE candidate eGenes and associated 435 mechanisms were well-replicated by performing cis-eQTL analysis of RNA-Seq datasets at 436 various resolutions for each of the GWAS variants. These eGenes included, among others, 437 the effect of risk variants at BLK and FAM167A (S9 Fig.), MIR146A (Fig. 4A), BANK1 (S12 438 Fig.) (Fig. 2).

443
Our results also demonstrate that RNA-Seq analysis is much better than microarrays in 444 identifying multiple eGenes for a single SNP (that may tag multiple functional variants). An 445 increased ratio of eQTLs to eGenes (average number of eGenes per candidate-causal 446 eQTL) was observed using RNA-Seq at exon-level (2.42) compared with gene-level (1.72) 447 and which were both greater than microarray (1.5) (S5 Fig.). The ability of RNA-Seq exon- , which we showed to be associated with 480 SLE (Fig. 6). These two potentially causal signals may reinforce each other. The novel 481 association identified in our group's recent GWAS study (63) at the IKZF2 locus implicated a 482 risk haplotype tagged by the risk allele of rs3768792[G]. We identified, by RNA-Seq splice-483 junction quantification exclusively, that variation at the rs3768792[G] risk allele led to 484 increased production of a shorter isoform of IKZF2 (Fig. 5) Our study also demonstrates that RNA-Seq is able to identify disease-relevant non-coding 499 RNAs. These type of transcripts have long been known to be of relevance to human 500 disease, however their detection and functional importance may have been under-estimated. 501 We identified three novel non-coding anti-sense RNAs using RNA-Seq: two regulated by 502 rs2736340 (RP11-148O21.2 and RP11-148O21.4) (Fig. 4B), and one regulated by 503 rs3794060 (RP11-660L16.2) (Fig. 4C). These findings were replicated in whole blood (Table   504   7).
The data we present in this manuscript demonstrate that a comprehensive integrated 507 approach for eQTL analysis should be undertaken at gene-, exon-and exon-junction level 508 quantification. Excluding one or more levels of analysis will mean that eQTLs and/or eGenes 509 may be missed. This is illustrated by a number of examples in Fig. 7. At some loci, there 510 was an exon-level effect, which was not observed at the gene-level. For example, the risk 511 allele rs3024505[A] was an eQTL for IL10, IL24, and FCAMR at exon-level resolution only 512 ( Hierarchical clustering tests could be designed to distinguish between these genuine gene-518 level and exon-level effects, such as at NADSYN1 or BANK1, where the gene-level effect is 519 likely to be driven by only a subset of exons (Fig. 3, S12 Fig.). There were occurrences 520 where significant candidate-causal eGenes were detected but there was no effect at the 521 exon-level. At TCF7, variants in low LD with rs7726414 exhibited significant exon-level cis-522 eQTLs (S3 Table) and were deemed to be independent of the disease-association. 523 However, gene-level analysis revealed that the risk rs7726414 variant was candidate-causal 524 for total TCF7 expression (S2 Table). These results emphasise that for any given eGene 525 there may be multiple genetic effects at different resolutions of quantification.  Definition of SLE candidate-causal cis-eQTL and eGene 647 We defined a GWAS SNP as an SLE candidate-causal cis-eQTL if it met the following 648 criteria: significant post-multiple testing adjustment (q < 0.05), not independent to the best 649 cis-eQTL from conditional analysis (P cond > 0.05), and supporting evidence of a shared 650 causal variant between gene expression and the primary GWAS signal based on 651 colocalisation (PP3 < PP4). The gene whose expression is modulated by the candidate-652 causal eQTL is defined as an SLE candidate-causal eGene (Fig. 1).

654
Validation of LCL SLE candidate-causal cis-eQTLs in whole blood 655 Cis-eQTL summary data from whole blood at RNA-Seq exon-level were made available for 656 384 individuals of the 856 TwinsUK cohort individuals ( Table 2). Expression profiling and 657 genotyping were identical to that as described for LCLs. We applied the same methodology 658 to this dataset to generate full cis-eQTL summary statistics, perform conditional and 659 colocalisation analysis, and classify SLE candidate-causal eQTLs and associated eGenes 660 (Fig. 1). In total, 3,793 exons were tested against, corresponding to 654 genes.

662
Geuvadis SLE candidate-causal cis-asQTL analysis 663 We investigated SLE disease-associated alternative splicing QTLs (asQTLs) using 664 European samples from the raw alignment files of the Geuvadis (35) 1000 Genomes RNA- 665 Seq project profiled in LCLs ( reads, and reads with more than eight mismatches for read and mate using Samtools(95). 673 We used the Altrans (96)  The first column is the key showing the relative P value of the eQTLs within each platform. 1120 For the platform-specific columns, if an eQTL-eGene association is candidate-causal in at 1121 least one quantification type, the data is displayed across all platforms. Rows are or ordered 1122 by decreasing cumulative significance across quantification types. To normalize across 1123 quantification types, relative significance of each association per column was calculated as 1124 the -log 2 (P/P max ); where P max is the most significant association per quantification type. If an 1125 association is deemed to be candidate-causal within a particular profiling-type, it is 1126 highlighted with an asterisk. 1127