Assessing the role of genome-wide DNA methylation between smoking and risk of lung cancer using repeated measurements: the HUNT study

Abstract Background It is unclear if smoking-related DNA methylation represents a causal pathway between smoking and risk of lung cancer. We sought to identify novel smoking-related DNA methylation sites in blood, with repeated measurements, and to appraise the putative role of DNA methylation in the pathway between smoking and lung cancer development. Methods We derived a nested case-control study from the Trøndelag Health Study (HUNT), including 140 incident patients who developed lung cancer during 2009–13 and 140 controls. We profiled 850 K DNA methylation sites (Illumina Infinium EPIC array) in DNA extracted from blood that was collected in HUNT2 (1995–97) and HUNT3 (2006–08) for the same individuals. Epigenome-wide association studies (EWAS) were performed for a detailed smoking phenotype and for lung cancer. Two-step Mendelian randomization (MR) analyses were performed to assess the potential causal effect of smoking on DNA methylation as well as of DNA methylation (13 sites as putative mediators) on risk of lung cancer. Results The EWAS for smoking in HUNT2 identified associations at 76 DNA methylation sites (P < 5 × 10–8), including 16 novel sites. Smoking was associated with DNA hypomethylation in a dose-response relationship among 83% of the 76 sites, which was confirmed by analyses using repeated measurements from blood that was collected at 11 years apart for the same individuals. Two-step MR analyses showed evidence for a causal effect of smoking on DNA methylation but no evidence for a causal link between DNA methylation and the risk of lung cancer. Conclusions DNA methylation modifications in blood did not seem to represent a causal pathway linking smoking and the lung cancer risk.


Quality control (QC) and functional normalization of the DNA methylation data
The quality control (QC) processing step using raw data files (IDAT) included background correction, dye bias correction, sex prediction and cell count estimation. Background and dye bias correction were performed to enhance the signal-to-noise-ratio of the methylation array.
Predicted sex was used to detect sample mix-ups. The cell count estimation (B cell, CD4 + cells, CD8 + cells, eosinophil, monocytes, neutrophils and natural killer cells) was obtained using the Houseman algorithm 1 . Among the 560 samples, we removed 6 sex detection outliers, 1 sex detection mismatch and 11 outliers from the methylated and unmethylated signal comparison. The 11 outliers were samples with predicted median methylated signal being more than 3 standard deviations from the expected signal. After removal of these 18 samples, the QC plot depicting the median intensity methylated vs. unmethylated signal shows a random distribution of the samples (Supplementary Figure S1). Functional normalization was undertaken on the remaining samples to remove technical batch effects.
The functional normalization algorithm developed by Fortin et al. 2 used the internal control probes present on the array to infer between-array technical variation. Principal components were applied to remove batch effects further. The QC and normalization of DNA methylation data were performed with R package meffil (version 1.1.0) 3 .

Statistical analysis
All statistical analyses were performed with R (version 3.6.1) or Stata/SE 15.1 (StataCorp, College Station, TX). Different sets of data that were used for specific statistical analyses are described in Supplementary Table S1.

EWAS for smoking in controls (cross-sectional analysis)
We first performed an EWAS for the smoking phenotype (the 7 levels) in blood samples collected from the controls in HUNT2. Linear regressions were performed with DNA methylation beta-values as the outcome and smoking phenotype as the exposure. Covariates were included in regression models to adjust for effects of sex, age, and estimated cell counts.
Surrogate variable analysis (SVA) 4 was used to generate 12 variables which were included as covariates in EWAS models in order to adjust for batch and other technical artifacts. The P value cut-off was set at epigenome-wide significance level (5×10 -8 ). EWAS for smoking was performed with R package meffil (version 1.1.0) 3 . Genomic regions were annotated using Human MethylationEPIC probe annotations (the positions are based on human genome reference build b37). The following annotations were used as locations in relation to gene region: TSS1500 (200 to 1500 nucleotides upstream of transcription start site); TSS200 (up to 200 nucleotides upstream of transcription start site); 5' UTR (5' untranslated region); 1st Exon; Body (gene body); and 3' UTR (3' untranslated region).

Confirmation of EWAS for smoking
To confirm the associations identified form the EWAS for smoking, we performed an analysis using repeatedly measured DNA methylation data from both the HUNT2 and the HUNT3 samples in relation to the smoking phenotype in HUNT2 among the controls. We used a less computationally intensive strategy by performing linear regression across the repeated measures with cluster-robust standard errors (LMRSE) 5 . Sex, age and estimated cell counts in HUNT2 were included as covariates in the analysis and Bonferroni correction was applied for the epigenome-wide significant DNA methylation sites identified from the EWAS. In order to confirm the results from the LMRSE, computationally intensive linear mixed effects model (LMEM) with random intercept was also performed for randomly selected 1000 DNA methylation sites across the entire genome using the repeatedly measured DNA methylation data. The analyses of LMRSE and LMEM were performed with R packages lmrse (version 2.0), lme4 (version 1.1.21) and omics (version 0.1.5), respectively.

Effect of change in smoking on change in DNA methylation in controls
We also explored the possible effect of change in smoking status between HUNT2 and HUNT3 (categorized as decrease, no change or increase) on change in DNA methylation (beta-value of DNA methylation in HUNT3 minus beta-value of DNA methylation in HUNT2) among the controls. Sex, age and cell count estimates (both HUNT2 and HUNT3) were included as covariates in the model. Bonferroni correction was applied for the epigenome-wide significant DNA methylation sites identified from the EWAS for smoking.
The association between changes in DNA methylation and changes in smoking status was tested using linear regression models using R package meffil (version 1.1.0) 3 .

EWAS for lung cancer (case-control analysis)
EWAS for lung cancer was performed among the lung cancer cases vs. controls with DNA methylation measured in HUNT2 and HUNT3 respectively as the exposure. Covariates included smoking phenotype, sex, age and cell count estimates and the epigenome-wide Pvalue cut-off was set at 5×10 -8 . Logistic regression analyses were performed in R package ewaff (version 0.0.1) (https://github.com/perishky/ewaff).

Mediation analysis of DNA methylation on the pathway between smoking and risk of lung cancer
The smoking-related DNA methylation sites that overlapped between the EWAS for smoking and the EWAS for lung cancer in the HUNT2 samples were individually evaluated as potential mediators between the smoking phenotype and lung cancer using mediation analysis. Sex, age and cell count estimates were included as covariates in the model. A counterfactual framework was applied, and the mediation analysis was performed with R package mma (version 9.0.0) 6 . Multiple mediators were then considered simultaneously, and a weighted methylation score was calculated as the sum of methylation beta-value at each mediator site weighted by its effect size on lung cancer. The indirect effect carried by the individual mediator or by the weighted methylation score was separated from the total effect of smoking phenotype.

Two-step Mendelian randomization analyses
MR may be used to infer causal effect of the exposure if genetic variants used as instrumental variables satisfy the following three fundamental assumptions: 1) strongly associated with the exposure; 2) independent of confounding factors of the observational association; and 3) associated with the outcome only through the exposure (no horizontal pleiotropy) 7,8 .
Potential causal associations were tested by applying MR two times, hence it is called twostep MR. A first step MR analysis was applied to evaluate the causal effect of smoking on DNA methylation. We used a smoking genetic score including 3 SNPs as an instrumental variable for the smoking phenotype (7 levels): rs6265 (BDNF) associated with smoking initiation, rs1051730 (CHRNA3) with smoking quantity, and rs3025343 (DBH) with smoking cessation 9 . The genetic score was calculated by summing up the risk alleles (associated with starting smoking, smoking more, and not quitting smoking) of these 3 SNPs for each individual. One-sample MR using the two-stage least square (2SLS) method was applied to investigate a causal relationship between smoking and DNA methylation at the sites identified in the EWAS for smoking. Sex, age and predicted cell counts were included in the MR models. Pearson correlation coefficient was used to evaluate the relationship between estimates derived from the MR and EWAS analyses.
A second step MR was performed to evaluate the putative causal association between DNA methylation and the risk of lung cancer. We applied a two-sample MR in order to leverage power from large genome-wide association studies (GWAS). Instruments for the DNA methylation sites detected as putative mediators with the mediation analyses were extracted from an mQTL GWAS in Generation Scotland. Generation Scotland has been described elsewhere 10,11 . In brief, a subset of participants (n=5 101) from the Generation