-
PDF
- Split View
-
Views
-
Cite
Cite
Tianzhong Yang, Lauren J Mills, Haoran Xue, Andrew Raduski, Lindsay A Williams, Logan G Spector, Impact of fetal expression quantitative trait loci on transcriptome-wide association study of childhood leukemia, Human Molecular Genetics, Volume 31, Issue 19, 1 October 2022, Pages 3207–3215, https://doi.org/10.1093/hmg/ddab336
- Share Icon Share
Abstract
Transcriptome-wide association studies increase the yield of loci associated with disease phenotypes by focusing on expression quantitative trait loci (eQTL). The major source of eQTL data for is the Gene and Tissue Expression (GTEx) project, which is comprised entirely of adults, mainly those >50 years of age at death. Since gene expression levels differ by developmental stage, it is not clear whether eQTLs derived from adult data sources are best suited for use in young-onset diseases such as pediatric cancers. To fill in this knowledge gap, we performed a large-scale eQTL mapping analysis in the GenCord study with newborn samples and compared it with GTEx. Under matched conditions, we found around 80% of the eQTLs in one study can be replicated in the other. However, among all eQTLs identified in GenCord (GTEx), 584 (1045) showed statistically significant differences in effect sizes in GTEx (GenCord). We further investigated how using fetal eQTL data can facilitate the genetic association study of acute lymphoblastic leukemia. GenCord and GTEx identified the same genetic loci with statistical significance; however, the overall association pattern was only weakly correlated. Our paper demonstrates age-differential eQTLs and shows their potential influence on childhood leukemia research.
Introduction
Pediatric cancers are rare. The incidence of all cancers occurring in children under 20 years of age is about 166 cases per million in the United States (1). Studying the genetic causes of pediatric cancers is very challenging due to the rarity of the disease; however, germline genetics are suspected to play an important role in pediatric cancer susceptibility (2) due to the short latency from birth to diagnosis. For over a decade, genome-wide association studies (GWAS) have been used to study genetic mechanisms of diseases or complex traits and have successfully identified tens of thousands of risk variants (3). A majority of GWAS loci are located in the non-coding regions whereas trait-associated single nucleotide polymorphisms (SNPs) are more likely to be expression quantitative trait loci (eQTLs), i.e. loci that explain variation in gene expression levels (4). Recently, transcriptome-wide association studies (TWAS), which integrate GWAS data by leveraging eQTLs, have increased the yield of genetic loci associated with disease phenotypes (5–7), albeit they have mostly been applied to adult traits.
In fact, the major source of eQTL data is from adult tissues. For example, the Gene and Tissue Expression (GTEx) project is comprised of deceased individuals, mainly those > 50 years of age at death. However, it is unclear whether such an eQTL data source is best suited for pediatric cancers. Gene expression levels differ by developmental stage (8). A recent study predicted 79% of the eQTLs in fetal brains shared with the adult brains and identified 172 putative genes with fetal-specific eQTLs (9). The lack of developmentally appropriate eQTL data may have hampered the adoption of the TWAS design in pediatric cancer studies as, to our knowledge, TWAS has only been utilized in one study on B-cell acute lymphoblastic leukemia (ALL). This study used eQTLs derived from GTEx data but did not discover any additional associations beyond these identified in standard GWAS (10). We were thus interested to learn whether developmentally appropriate sources of eQTL data are better for use in young-onset diseases such as B-cell ALL. To this end, we investigated the relationship between SNPs and gene expression in adult and newborn B-lymphoblastoid cell lines through eQTL mapping. Then, we explored TWAS with the adult and newborn eQTL sources to evaluate their performance in improving statistical power to identify potentially causal loci associated with B-cell ALL.

(A) distance of eQTLs to the corresponding eGenes identified in GTEx and GenCord (B) and the effect sizes of eQTLs in the two studies (FDR corresponds to the z-score test comparing the effect sizes).
Results
eQTL mapping comparison between GenCord and GTEx
Using an false discovery rate (FDR) < 0.05, we identified 6063 eGenes in GTEx and 5825 in GenCord. The overall effect sizes of eQTLs on gene expression were comparable for newborns and adults (Supplementary Material, Fig. S1). There was an overlap of 3357 eGenes and 502 eQTL-eGene pairs. The shared eQTL-eGene pairs had highly correlated effect sizes in the two studies (r = 0.97). The correlation was between 0.96 and 0.97 with varying FDR thresholds (Supplementary Material, Table S1). For both GenCord and GTEx, the majority eGenes were protein-coding genes as expected. Around half of the identified eQTLs in both studies were located within 20 kb upstream and downstream of the transcription start site (TSS) (Fig. 1A).
Replication of eQTLs identified in GenCord was high in GTEx (proportion of true positive π1 = 0.82). Similar sharing of eQTLs identified in GTEx was observed in GenCord (π1 = 0.81). Additionally, the log enrichment odds ratios in functional annotation had correlation as high as 0.98. Together, the results suggested substantial similarity of eQTL patterns in fetal and adult B-cell lymphocytes. Generally, smaller enrichment P-values of all types were observed in the blood tissue/lymphoblastoid cell line (GM12878) than the other tissues under evaluation (Supplementary Material, Fig. S2).
In blood tissues, eQTLs were positively enriched in TSS, transcription factor binding sites, open chromatin regions, chromatin states, all histone modification types except H3K27me3 (Supplementary Material, Tables S2 and S3), and were depleted in the repressed chromatin states (Supplementary Material, Tables S4 and S5).
On the other hand, a smaller proportion of eQTL-eGene pairs showed age-differential effect sizes (Fig. 1B). Five hundred and eighty-four eQTLs identified in GenCord had a difference of effect sizes in GTEx (Supplementary Material, Table S7) and 1045 eQTLs identified in GTEx showed different effect sizes in GenCord (Supplementary Material, Table S8). Among 1509 eGenes that harbored these eQTLs, 628 protein-coding genes were present in the Reactome database. These genes were significantly enriched in nine pathways by the binomial test (Supplementary Material, Table S9), including antigen presentation, interferon signaling and immunoregulatory interactions between lymphoid and non-lymphoid cell pathways.
We found some evidence that fetal eQTLs were enriched with variants associated with B-cell ALL. With the P-value threshold decreasing from 5 × 10−2 to 5 × 10−4, eQTLs were more strongly enriched in variants associated with B-cell ALL in the hypergeometric test (Supplementary Material, Table S6). However, the Wilcoxon rank-sum test directly comparing the rank of P-values of identified eQTLs and non-eQTLs was not statistically significant (P-values = 0.341).
TWAS comparison between GenCord and GTEx
With 147 samples in GTEx and GenCord, 4429 and 4299 gene models, respectively, had good power to predict the corresponding gene expression with a cross-validated correlation >0.1 and corresponding P-value < 0.05. However, only 2123 genes were shared between the two studies and different prediction performances were observed. For example, 973 out of 4299 genes whose prediction models based on GenCord had at least two times higher variance explained than these same genes based on GTEx. For gene models with good prediction power in both GenCord and GTEx, the prediction power was moderately correlated (R = 0.75). Using all samples of GenCord increased the number and sharing of well-predicted gene models to 5021 and 2264, respectively, suggesting that a mild increase in sample size of eQTL data can yield better prediction.
Figure 3 shows a Miami plot of TWAS results using GTEx and GenCord using 147 samples each. The two eQTL datasets identified the same genetic regions that reached genome-wide significance. Closely located to each other on chromosome 7, FIGNL1 and IKZF1 were identified using GenCord data while only FIGNL1 was identified using GTEx. FIGNL1 was a stronger signal than IKZF1, the more well-known risk gene associated with ALL (10, 11). The P-value of the genetically predicted expression of FIGNL1 was much smaller in GenCord than GTEx (Table 1). Interestingly, a SNP rs11980379 located in this region was reported to have a stronger effect on the risk of ALL for the younger age group in another small GWAS study (12), explaining why it was not discovered using GTEx but GenCord. In fact, in the multivariate TWAS analysis adjusting for the predicted gene expression levels of FIGNL1 and IKZF1, both genes were highly significant (FIGNL1 P-value = 5.9 × 10−16; IKZF1 P-value = 7.9 × 10−8). Another genetic risk region was located at chromosome 1p36, where a few different genes showed very small P-values (Table 2). One genotyped SNP rs1695824 was the key driving signal for the significance (OR: 1.78, P-value = 5.4 × 10−33) with minor allele frequency (MAF) 0.22, which was associated with the expression levels of several genes, for example, ANKRD65 in GenCord, and SLC35E2B in GTEx. This signal has not been previously reported to be associated with childhood ALL and thus we performed a sensitivity analysis with a different set of controls (9212 healthy samples from Geisinger study). The SNP rs1695824 was imputed with high quality (imputation R2 = 0.92) and still showed a very strong association with B-cell ALL with the same direction (P-value = 1.6 × 10−67). We thus believed this SNP warrants further investigation. Surprisingly, beyond the top two signals, the overall pattern of the TWAS results was quite different between the GenCord and GTEx eQTL datasets. The Pearson correlation of the logarithm of odds ratios for genes shared by GenCord and GTEx was only 0.26. Lastly, with the full GenCord eQTL dataset with 190 samples, we confirmed one more gene PIP4K2A on chromosome 10 that reached statistical significance (Table 1), which has been reported earlier (13). The TWAS results stratifying by age of onset identified the same set of genetic loci as using all cases (Supplementary Material, Fig. S3). The three genes, i.e. FIGNL1, IKZF1 and PIP4K2A, had P-values > 0.5 in the TEsting Direct Effects (‘TEDE’) tests (Supplementary Material, Table S10), providing no evidence of horizontal pleiotropy for SNPs selected in either GenCord or GTEx and thus strengthening the validity of our TWAS results.
Prediction power of the known genes and the corresponding TWAS results (R2 is the variance explained in gene expression in stage 1 and P-values are significance level in stage 2). Two values are reported for R2 and P-value in GenCord, corresponding to 147 samples and 190 samples.
Genes . | Chr . | Expressed in lymphocytes from GenCord or GTEx? . | GTEx . | GenCord . | ||
---|---|---|---|---|---|---|
R2 . | P-value . | R2 . | P-value . | |||
FIGNL1 | 7 | Both | 0.28 | 2.3 × 10−8 | 0.15(0.04) | 5.4 × 10−15 (7.2 × 10−14) |
IKZF1 | 7 | Both | X | X | 0.04(0.04) | 1.3 × 10−7 (6.5 × 10−5) |
PIP4K2A | 10 | Both | X | X | X (0.04) | X (2.1 × 10−7) |
BAK1 | 6 | Both | 0.53 | 0.01 | 0.42(0.49) | 0.02 (0.03) |
LHPP | 10 | Both | X | X | 0.21(0.17) | 0.64 (0.48) |
CDKN2A | 9 | Both | X | X | X | X |
C14orf119 | 14 | Both | X | X | X | X |
SLC7A8 | 14 | Both | X | X | X | X |
CDKN2B | 9 | GTEx | X | X | X | X |
DDC | 7 | Neither | X | X | X | X |
TMEM26 | 10 | Neither | X | X | X | X |
Genes . | Chr . | Expressed in lymphocytes from GenCord or GTEx? . | GTEx . | GenCord . | ||
---|---|---|---|---|---|---|
R2 . | P-value . | R2 . | P-value . | |||
FIGNL1 | 7 | Both | 0.28 | 2.3 × 10−8 | 0.15(0.04) | 5.4 × 10−15 (7.2 × 10−14) |
IKZF1 | 7 | Both | X | X | 0.04(0.04) | 1.3 × 10−7 (6.5 × 10−5) |
PIP4K2A | 10 | Both | X | X | X (0.04) | X (2.1 × 10−7) |
BAK1 | 6 | Both | 0.53 | 0.01 | 0.42(0.49) | 0.02 (0.03) |
LHPP | 10 | Both | X | X | 0.21(0.17) | 0.64 (0.48) |
CDKN2A | 9 | Both | X | X | X | X |
C14orf119 | 14 | Both | X | X | X | X |
SLC7A8 | 14 | Both | X | X | X | X |
CDKN2B | 9 | GTEx | X | X | X | X |
DDC | 7 | Neither | X | X | X | X |
TMEM26 | 10 | Neither | X | X | X | X |
Prediction power of the known genes and the corresponding TWAS results (R2 is the variance explained in gene expression in stage 1 and P-values are significance level in stage 2). Two values are reported for R2 and P-value in GenCord, corresponding to 147 samples and 190 samples.
Genes . | Chr . | Expressed in lymphocytes from GenCord or GTEx? . | GTEx . | GenCord . | ||
---|---|---|---|---|---|---|
R2 . | P-value . | R2 . | P-value . | |||
FIGNL1 | 7 | Both | 0.28 | 2.3 × 10−8 | 0.15(0.04) | 5.4 × 10−15 (7.2 × 10−14) |
IKZF1 | 7 | Both | X | X | 0.04(0.04) | 1.3 × 10−7 (6.5 × 10−5) |
PIP4K2A | 10 | Both | X | X | X (0.04) | X (2.1 × 10−7) |
BAK1 | 6 | Both | 0.53 | 0.01 | 0.42(0.49) | 0.02 (0.03) |
LHPP | 10 | Both | X | X | 0.21(0.17) | 0.64 (0.48) |
CDKN2A | 9 | Both | X | X | X | X |
C14orf119 | 14 | Both | X | X | X | X |
SLC7A8 | 14 | Both | X | X | X | X |
CDKN2B | 9 | GTEx | X | X | X | X |
DDC | 7 | Neither | X | X | X | X |
TMEM26 | 10 | Neither | X | X | X | X |
Genes . | Chr . | Expressed in lymphocytes from GenCord or GTEx? . | GTEx . | GenCord . | ||
---|---|---|---|---|---|---|
R2 . | P-value . | R2 . | P-value . | |||
FIGNL1 | 7 | Both | 0.28 | 2.3 × 10−8 | 0.15(0.04) | 5.4 × 10−15 (7.2 × 10−14) |
IKZF1 | 7 | Both | X | X | 0.04(0.04) | 1.3 × 10−7 (6.5 × 10−5) |
PIP4K2A | 10 | Both | X | X | X (0.04) | X (2.1 × 10−7) |
BAK1 | 6 | Both | 0.53 | 0.01 | 0.42(0.49) | 0.02 (0.03) |
LHPP | 10 | Both | X | X | 0.21(0.17) | 0.64 (0.48) |
CDKN2A | 9 | Both | X | X | X | X |
C14orf119 | 14 | Both | X | X | X | X |
SLC7A8 | 14 | Both | X | X | X | X |
CDKN2B | 9 | GTEx | X | X | X | X |
DDC | 7 | Neither | X | X | X | X |
TMEM26 | 10 | Neither | X | X | X | X |
Putative genes on chromosome 1 discovered by TWAS (nSNP is the number of SNPs included for TWAS step 2)
Dataset of TWAS step 1 . | Gene . | TSS . | TES . | nSNP . | TWAS P-value . | TEDE P-value . |
---|---|---|---|---|---|---|
GenCord (149 samples) | ANKRD65 | 1 418 420 | 1 421 769 | 4 | 1.2 × 10−11 | <0.001 |
GenCord (190 samples) | PUSL1 | 1 308 567 | 1 311 584 | 5 | 1.1 × 10−21 | 0.15 |
MRPL20 | 1 402 047 | 1 407 313 | 2 | 8.3 × 10−36 | 0.30 | |
ANKRD65 | 1 418 420 | 1 421 769 | 5 | 5.5 × 10−16 | <0.001 | |
RP11-345P4.9 | 1 613 758 | 1 615 795 | 7 | 2.2 × 10−26 | 0.20 | |
GTEx (149 samples) | RP4-758 J18.13 | 1 409 096 | 1 410 618 | 7 | 5.2 × 10−8 | <0.001 |
SLC35E2B | 1 659 529 | 1 692 728 | 6 | 3.2 × 10−8 | <0.001 | |
RP1-283E3.4 | 1 724 512 | 1 737 251 | 8 | 4.3 × 10−14 | <0.001 |
Dataset of TWAS step 1 . | Gene . | TSS . | TES . | nSNP . | TWAS P-value . | TEDE P-value . |
---|---|---|---|---|---|---|
GenCord (149 samples) | ANKRD65 | 1 418 420 | 1 421 769 | 4 | 1.2 × 10−11 | <0.001 |
GenCord (190 samples) | PUSL1 | 1 308 567 | 1 311 584 | 5 | 1.1 × 10−21 | 0.15 |
MRPL20 | 1 402 047 | 1 407 313 | 2 | 8.3 × 10−36 | 0.30 | |
ANKRD65 | 1 418 420 | 1 421 769 | 5 | 5.5 × 10−16 | <0.001 | |
RP11-345P4.9 | 1 613 758 | 1 615 795 | 7 | 2.2 × 10−26 | 0.20 | |
GTEx (149 samples) | RP4-758 J18.13 | 1 409 096 | 1 410 618 | 7 | 5.2 × 10−8 | <0.001 |
SLC35E2B | 1 659 529 | 1 692 728 | 6 | 3.2 × 10−8 | <0.001 | |
RP1-283E3.4 | 1 724 512 | 1 737 251 | 8 | 4.3 × 10−14 | <0.001 |
Putative genes on chromosome 1 discovered by TWAS (nSNP is the number of SNPs included for TWAS step 2)
Dataset of TWAS step 1 . | Gene . | TSS . | TES . | nSNP . | TWAS P-value . | TEDE P-value . |
---|---|---|---|---|---|---|
GenCord (149 samples) | ANKRD65 | 1 418 420 | 1 421 769 | 4 | 1.2 × 10−11 | <0.001 |
GenCord (190 samples) | PUSL1 | 1 308 567 | 1 311 584 | 5 | 1.1 × 10−21 | 0.15 |
MRPL20 | 1 402 047 | 1 407 313 | 2 | 8.3 × 10−36 | 0.30 | |
ANKRD65 | 1 418 420 | 1 421 769 | 5 | 5.5 × 10−16 | <0.001 | |
RP11-345P4.9 | 1 613 758 | 1 615 795 | 7 | 2.2 × 10−26 | 0.20 | |
GTEx (149 samples) | RP4-758 J18.13 | 1 409 096 | 1 410 618 | 7 | 5.2 × 10−8 | <0.001 |
SLC35E2B | 1 659 529 | 1 692 728 | 6 | 3.2 × 10−8 | <0.001 | |
RP1-283E3.4 | 1 724 512 | 1 737 251 | 8 | 4.3 × 10−14 | <0.001 |
Dataset of TWAS step 1 . | Gene . | TSS . | TES . | nSNP . | TWAS P-value . | TEDE P-value . |
---|---|---|---|---|---|---|
GenCord (149 samples) | ANKRD65 | 1 418 420 | 1 421 769 | 4 | 1.2 × 10−11 | <0.001 |
GenCord (190 samples) | PUSL1 | 1 308 567 | 1 311 584 | 5 | 1.1 × 10−21 | 0.15 |
MRPL20 | 1 402 047 | 1 407 313 | 2 | 8.3 × 10−36 | 0.30 | |
ANKRD65 | 1 418 420 | 1 421 769 | 5 | 5.5 × 10−16 | <0.001 | |
RP11-345P4.9 | 1 613 758 | 1 615 795 | 7 | 2.2 × 10−26 | 0.20 | |
GTEx (149 samples) | RP4-758 J18.13 | 1 409 096 | 1 410 618 | 7 | 5.2 × 10−8 | <0.001 |
SLC35E2B | 1 659 529 | 1 692 728 | 6 | 3.2 × 10−8 | <0.001 | |
RP1-283E3.4 | 1 724 512 | 1 737 251 | 8 | 4.3 × 10−14 | <0.001 |
New putative locus on chromosome 1
Multiple genes located on chromosome 1p36 were significantly associated with ALL in the TWAS analysis (Table 2). Although a few genes were shown to have pleiotropic effects according to the TEDE test, three genes PUSL1, MRPL20 and RP11-345P4.9 had no evidence supporting the violation of the TWAS assumptions (P-values > 0.05). PUSL1 and MRPL20 were previously reported as gene targets for genetic alternation among benzene-exposed adult workers before symptoms of leukemia occurred (14). Another interesting gene in this region is ANKRD65. In TEDE test, ADKRD65 showed evidence of pleiotropy (P-value < 0.0001) and it was previously reported to be associated with having been breastfed as baby (15, 16), which was inversely associated with childhood leukemia (ref 17). The predicted gene expression levels of ADKRD65 were negatively correlated with those of PUSL1 (R = −0.74) and MRPL20 (R = −0.77). However, in the multivariate TWAS analysis adjusting for the predicted gene expression levels of PUSL1, MRPL20, and RP11-345P4.9, ANKRD6 was still marginally significantly associated with ALL (P-value = 0.003). More investigation is needed for this gene to solidify its relationship with ALL.
Discussion
To evaluate how developmental-stage appropriate eQTL data could help us better understand the etiology of childhood leukemia, we compared the eQTL mapping and TWAS analysis using B-cell lymphocytes from adults (GTEx) and newborns (GenCord), matching the studies on sample size, SNPs and analytic pipelines. Around 80% sharing of the eQTLs identified between newborn and adult lymphoblastoid cells was estimated, nonetheless, more than a thousand eQTLs exerted statistically different effects on the corresponding gene expression between the two study sets. In the TWAS analysis, we observed hundreds of genes whose expression levels could be well-predicted by eQTLs in one study but not the other. Two genetic loci on 7p12 and 1p36 were identified to be associated with B-cell ALL in TWAS by both GTEx and GenCord. At first glance, the two eQTL datasets identified the same regions; however, the FIGNL1 had much stronger signal in GenCord than GTEx, and the other signals were very different. These results suggested that using tissues from adults may be an acceptable choice for TWAS of childhood cancers given its availability, but the age-appropriate eQTLs were different and likely preferable and may play an important role in elucidating the germline architecture of childhood leukemia and other cancers.
The predicted gene expression of FIGNL1, IKZF1 and PIP4K2A in newborns were associated with B-cell ALL. As there were concerns about the violation of the TWAS assumptions, we confirmed the validity of our results by using the recently developed TEDE testing procedure. On the other hand, for the new genetic locus located on 1p36, the leading SNP rs1695824 was not reported in a GWAS study with overlapping cases with our study and in low linkage disequilibrium (LD) with the nearby genotyped/imputed SNPs (We reported the MAF and regional plot around this leading SNP in Supplementary Material, Table S11 and Supplementary Material, Fig. S4); however, the previous study used a different control dataset (10). Given this concern, we applied a sensitivity analysis using a different control dataset and still observed very strong association for rs1695824. We obtained the Hi-C visualization of this region (Supplementary Material, Fig. S5) in GM12878, the tissue with most enriched eQTLs, and found that long-range interaction is possible between the leading SNP and the genes PUSL1 and MRPL20, and RP11-345P4.9 (18). Only the interaction with the RP11-345P4.9 was slightly weaker than the other two genes. We present the association but suggest that it be considered provisional until there is confirmation in further investigation. There were some differences between our TWAS results and the ones reported by Vijayakrishnan et al. (10) who leveraged eQTL datasets from multiple tissues in the GTEx (Table 1). In our study, only models for BAK1 and FIGNL1 had good prediction power using GTEx eQTL data from EBV-transformed lymphocytes. It may be because SNPs with MAF < 0.05 and SNPs not shared across platforms were excluded. In fact, our enrichment analysis results of eQTLs in the functional annotations did suggest the usefulness of using multiple tissues in TWAS, consistent with other findings (19,20). However, we note that DDC and TMEM26 were not expressed in lymphocytes in either study and CDKN2B was not expressed in GenCord, suggesting there to be a modest risk of increasing false positives when using multiple tissue types in TWAS.
As for the limitations of this study, we only focused on the leading variant for each eGene in eQTL mapping to compare the difference between GenCord and GTEx given the limited sample sizes in both studies. Genes with multiple eQTLs were observed in larger eQTL studies (21). In addition, we included the MHC region for the eQTL mapping and TWAS analysis as HLA genes were previously reported to be associated with childhood leukemia (22). We found that eGenes with age-differential eQTLs were enriched in MHC-related pathways (Supplementary Material, Table S9), especially class I MHC. As noted, RNA-seq pipelines may underestimate expression for HLA genes, which mostly impact eQTL mapping for HLA class II (23). Nonetheless, the enrichment analysis should be interpreted with caution. Moreover, GTEx had 2374 more genes reaching the inclusion criteria than GenCord, which could be an artifact due to the platforms, or true differential gene expression at different developmental stages, but we cannot discern the true explanation from our study. We only compared the effect sizes of eQTLs corresponding to the expressed genes shared between GenCord and GTEx in eQTL mapping. Therefore, it is likely that the number of age-differential eQTLs was underestimated. In addition, the lymphocytes cells were EBV transformed for both GTEx and GenCord. Gene expression levels could be affected by the transformation, although research has shown that gene expression differences are negligible between primary and transformed lymphoblastic cells (24). Therefore, we expect the likelihood to be small that TWAS would miss genes of importance in B-cell ALL that may be observed in primary cells. Lastly, we only explored one fetal tissue and one major childhood disease. Although tissue-sharing in eQTLs was reported for adults (20), such sharing has not been fully accessed for children. Our work inevitably suffers from power limitation due to the current sparsity of the eQTL and GWAS data with developmental origins, and would serve as a starting point for such analysis and may not be generalizable to all tissues/diseases.
In a nutshell, we demonstrated the existence of age-differential eQTLs in B-cell lymphocytes. Additionally, we found the top gene FIGNL1 had a stronger effect on the risk of ALL in GenCord and a novel genetic locus on 1p36. Although the results of weaker signals were quite different between GenCord and GTEx in TWAS analysis, the same genetic loci were significantly associated with B-cell ALL. Although the differences in this instance were not dramatic, they pointed to the need for matched SNP array and gene expression data from more individuals and across multiple tissues at various developmental stages to apply to case–control data from diseases with early-life origins, such as pediatric cancers. We are aware of only one other comparison of eQTLs in fetal and adult tissues, namely the brain, which showed fetal-specific eQTL patterns but only the adult diseases were analyzed for eQTL enrichment (9). In recognition of the gap in the literature, our research comprehensively evaluated age-differential eQTLs and their effects on B-cell ALL, that has early-life origins, providing scientific grounds for tissue-banking projects such as the National Institutes of Health developmental genotype-tissue expression project.
Materials and Methods
GenCord study
GenCord (accession number: EGAS00001000446) is a collection of umbilical cord samples from 204 newborns of central European descent, from which immortalized lymphoblastoid cell lines were derived (25). After excluding outliers and correlated samples, 190 of them had the genotyping data on 2.5 million SNPs and RNA-seq data of 15 M–80 M reads derived from B-lymphoblastoid cells, which give rise to B-cell ALL. We imputed sex using the genotyping data, resulting in 101 males and 89 females. We set SNPs with the genotyping probability <0.85 as missing and excluded SNPs if missingness >5%, the P-value for departure from Hardy–Weinberg disequilibrium <10−7, MAF < 0.05 based on all samples and each sex, multiallelic and on sex chromosomes. Standard quality control (QC) was performed on RNA-seq data following the TOPMed RNA-Seq pipeline and read counts were processed by the trimmed mean of M-values normalization method. Genes were selected based on two expression thresholds: (i) > =0.1 transcripts per kilobase million in > = 20% samples (ii) > =6 reads in > = 20% samples, following the recommendations of the GTEx consortium (www.gtexportal.org/home/documentationPage). A total of 19 628 genes remained for analysis.
GTEx study
The GTEx project aims to characterize genetic effects on the transcriptome across adult human tissues (19, 20). We obtained the individual-level whole genome sequencing data through dbGap (accession number: phs00424.v8) and normalized gene expression data in EBV-transformed B-cell lymphocytes and covariates of 147 samples (94 females, 53 males) through GTEx portal. The QC steps have been described elsewhere (ref. 20). We restricted our analysis to SNPs passing QC and having MAF > 0.05 and genes passing expression thresholds, resulting in 22 002 genes for analysis.
B-cell ALL case–control study
The B-cell ALL cases data arise from four studies including the Children’s Oncology Group (GOG) studies (AALL0232, P9904, P9906) and Total Therapy XIIIB/XV from St. Jude Children’s Research Hospital (10) and controls arise from the Atherosclerosis Risk in Communities Study (ARIC, (26)). COG (phs000638.v1.p1 and phs000637.v1.p1) and ARIC (phs000090.v1.p1) data were obtained from dbGap. Ancestries were imputed using the RFMix package (27) and only subjects with >75% European ancestry (EA) and <10% Admixed Americans were included in the analysis. For all the germline datasets, imputation was performed using the TOPMed reference panel and QC steps were performed in each study that included removing samples with >10% missing data among all variants with missing <20%, removing variants that showed an association between missingness and sex (P-value < 10−7), removing variants that were indels, multiallelic, rare (MAF < 0.01) or poorly imputed (R2 < 0.5). The five datasets were then combined, and variants were filtered for missingness (<5%), divergence from Hardy–Weinberg disequilibrium (P-value < 10−7), and missingness by case associations (P-value < 10−6). Additionally, samples with high IBD (>0.2), outlying genotypes and ambiguous sex were excluded, leaving 1592 cases (849 males, 743 females) and 9736 ARIC controls (4587 males, 5149 females) in total across the studies.
For sensitivity analysis, we used the Giesinger control set (phs000957.v1.p1) along with the 1592 cases to validate hits observed from above. The Giesinger control genetic data were imputed using TOPMed reference panel and followed the same post-imputation steps as described above. A total of 9212 healthy controls (3629 males and 5583 females) were included in the study with >99% self-reported EA.
Statistical analysis
We performed eQTL mapping and TWAS analysis using GenCord and GTEx separately (Fig. 2). As we were interested in comparing GenCord and GTEx with regard to their potential in boosting the statistical power in identifying genetic risk factors, we limited the analysis to the same set of common genetic variants and the same number of males and females in both datasets. Specifically, 94 females and 53 males were randomly selected from GenCord to match the sex distribution with GTEx, leading to 147 samples in both studies. Moreover, the eQTL mapping was analyzed on the shared SNPs between GenCord and GTEx, and the TWAS was performed on the shared SNPs among GenCord, GTEx and the ALL case–control study. We also worked with the full GenCord data (190 samples) to find genetic signals associated with B-cell ALL, in which case the shared SNPs between GenCord and the ALL case–control study were included for the TWAS analysis.

Summary flow diagram for the analysis. (Image is made with BioRender - biorender.com)

Miami plot of TWAS results applying to ALL data using prediction models trained in GenCord and GTEx eQTL data (147 individuals).
eQTL mapping
We identified associations between the expression levels and genetic variants located within 100 kb of the target gene’s TSS, i.e. cis-eQTLs (ref. 28). Normalized gene expression values were regressed on the additive genotypes through a linear regression model with correction for covariates, including the top principal components (PCs), sex, known batch effects and hidden confounders estimated through probabilistic estimation of expression residuals (PEER). Specifically, 10 PEER factors and 3 PCs were adjusted in the analysis of GenCord data, while PCs were derived from LD-pruned common variants (R2 > 0.02, window size 250 kb with a step of 5 kb) and PEER factors were calculated using the ‘PEER’ R package (29). Fifteen PEER factors, sequencing protocol (polymerase chain reaction-based or -free), sequencing platform (HiSeq 2000 or HiSeq X), and 5 PCs were obtained from the GTEx portal and adjusted for analysis of GTEx data.
The FDR were calculated by first correcting for the number of SNPs per gene through estimating the null beta distribution using 1000–10 000 permutations and then corrected for testing the number of genes by using Storey’s q-value method (9). Significant eQTLs were declared at FDR < 0.05 and the corresponding genes were defined as ‘eGenes’. For eQTL mapping, we followed the eQTL discovery pipeline from the GTEx Consortium (https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl), in which the key software adopted for the analysis is FastQTL (20). The same eQTL pipeline was utilized in GenCord and GTEx separately. We assessed eQTL sharing in GenCord and GTEx by the π1 statistic, the proportion of non-null associations, through an R package ‘qval’ (30). Furthermore, we tested the difference in allele effect sizes of eQTLs between the two studies by dividing the difference of beta coefficients by the standard error (two-sample z-score test) and significance was declared at FDR < 0.05.
Enrichment Analysis
Enrichment of eQTLs in functional annotation categories
We used GWAS analysis of regulatory or functional information enrichment with LD correction (GARFIELD) to analyze the enrichment pattern of the eQTLs identified in GTEx and GenCord, accounting for the distance to the nearest TSS, MAF and LD structure (31). The minimum nominal P-value for each gene in the eQTL analysis was used as the P-value for a single ‘SNP’ in the GWAS study (9). We estimated the enrichment odds ratios and P-values for each annotation based on regulatory maps from the ENCODE and Roadmap Epigenomics projects (1004 tests in total, 166 blood traits). The analysis was conducted using the GARFIELD package v2 available at https://www.ebi.ac.uk/birney-srv/GARFIELD/.
Enrichment of eGenes with different eQTL patterns in biological pathways
We evaluated the pathway enrichment of protein-coding eGenes that harbor an eQTL with age-differential effect sizes through Reactome, a manually curated database of reactions, pathways, and biological processes (32). Binomial tests were performed on the list of eGenes through the online Reactome analysis tool (v74), which contains 52.6% of the 20 296 predicted human protein-coding genes. Statistical significance was declared at FDR < 0.05.
Enrichment of eQTLs associated with ALL
We performed Fisher’s exact tests to evaluate whether eQTLs were more likely to be associated with ALL. The eQTLs were obtained from using 190 GenCord samples and were compared with the non-eQTLs shared between GenCord and ALL case–control studies. We examined the enrichment at three GWAS P-value thresholds (5 × 10−2, 5 × 10−3 and 5 × 10−4).
TWAS analysis
TWAS is a type of two-stage least squares regression analysis utilizing genetic variants as instrumental variables to identify causal genes (5, 7). In the first stage, the genetically regulated gene expression was calculated by building elastic net models in the GTEx and GenCord datasets for each gene separately. SNPs located within 500 kb before and after the TSS were included as predictors (5). Models were fitted using the FUSION software (5), and those with good prediction power (cross-validated correlation between predicted and true expression >0.1 and corresponding P-value <0.05) were retained. In the second stage, we tested the relationship between the genetically regulated gene expression and B-cell ALL using logistic regression, with the first 20 genetic PCs included as covariates. The family-wise Type I error rate was controlled at 0.05 by Bonferroni’s correction. We also performed a sensitivity analysis using all GenCord data by using shared SNPs between GenCord and ALL data. We analyzed the association using all cases as well as a subset of cases stratified by age at diagnosis as <=10 years or >10 years. The same ARIC controls were used for all models.
Checking pleiotropy assumption
The key assumptions for TWAS are likely to be violated in real-life data. The most concerning one is related to horizontal pleiotropy of genetic variants, which may introduce misleading results (6). Therefore, we applied the TEDE method to examine the potential violation of the TWAS assumptions for the genes that reached the genome-wide significance level. TEDE was originally developed for summary-level statistics (33), but as the individual-level data were available, we directly estimated the covariance matrix of SNPs from the B-cell ALL case–control study. Furthermore, for SNPs predictive of gene expression levels, we estimated the effect sizes and their standard errors through fitting joint linear regression models. We reported the P-values of the TEDE-aSPU2 tests that incorporated the estimation variability of the first stage of TWAS.
Acknowledgements
The authors acknowledge Dr. Patrick Monnahan for his contribution in preparing the datasets; the authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported within this paper.
Conflicts of Interest statement. None declared.
Funding
The study was supported by Children’s Cancer Research Fund, Minneapolis, Minnesota (to Dr Yang).
Abbreviations
- expression quantitative trait loci
(eQTL)
Gene and Tissue Expression
(GTEx)
genome-wide association studies
(GWAS)
single nucleotide polymorphisms
(SNPs)
transcriptome-wide association studies
(TWAS)
acute lymphoblastic leukemia
(ALL)
minor allele frequency
(MAF)
quality control
(QC)
Children’s Oncology Group
(GOG)
Atherosclerosis Risk in Communities Study
(ARIC)
European ancestry
(EA)
transcription start site
(TSS)
principal components
(PCs)
probabilistic estimation of expression residuals
(PEER)
linkage disequilibrium
(LD)
false discovery rates
(FDR)
GWAS analysis of regulatory or functional information enrichment with LD correction
(GARFIELD)
TEsting Direct Effects
(TEDE)
References
Li, K., Jing, Y., Yang, C., Liu, S., Zhao, Y., He, X., Li, F., Han, J. and Li, G. (
Staley, J.R., Blackshaw, J., Kamat, M.A., Ellis, S., Surendran, P., Sun, B.B., Paul, D.S., Freitag, D., Burgess, S., Danesh, J. and Young, R. et al. (
Kamat, M.A., Blackshaw, J.A., Young, R., Surendran, P., Burgess, S., Danesh, J., Butterworth, A.S. and Staley, J.R. (
Kwan, M.L., Buffler, P.A., Abrams, B. and Kiley, V.A. (
Rao, S.S.P.,Huntley, M.H.,Durand,N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S. et al. (
Aguet, F., Barbeira, A.N., Bonazzola, R., Brown, A., Castel, S.E., Jo, B., Kasela, S., Kim-Hellmuth, S., Liang, Y.Y., Oliva, M. et al. (
Shang, L., Smith, J.A., Zhao, W., Kho, M., Turner, S.T., Mosley, T.H. and Kardia sl, Zhou X. (
Urayama, K.Y., Thompson, P.D., Taylor, G.M., Trachtenberg, E.A. and Chokkalingam, A.P. (
Aguiar, V.R., César, J., Delaneau, O., Dermitzakis, E.T. and Meyer, D. (
Çali skan, M., Cusanovich, D.A., Ober, C. and Gilad, Y. (
Gutierrez-Arcelus,M., Lappalainen, T.,Montgomery, S.B., Buil, A., Ongen,H., Yurovsky, A., Bryois, J., Giger, T., Romano, L., Planchon, A. et al. . (
The ARIC investiagtors (
Maples, B.K., Gravel, S., Kenny, E.E. and Bustamante, C.D. (
Peters, J.E., Lyons, P.A., Lee, J.C., Richard, A.C., Fortune, M.D., Newcombe, P.J., Richardson, S. and Smith, K.G. (
Stegle, O., Parts, L., Piipari, M., Winn, J. and Durbin, R. (
Storey, J.D. and Tibshirani, R. (
Iotchkova, V., Ritchie, G.R.S., Geihs, M., Morganella, S., Min, J.L., Walter, K., Timpson, N.J., Dunham, I., Birney, E. and Soranzo, N. (
Fabregat, A., Sidiropoulos, K., Viteri, G., Forner, O., Marin-Garcia, P., Arnau, V., D' Eustachio, P., Stein, L. and Hermjakob, H. (