Frequency of heavy vehicle traffic and association with DNA methylation at age 18 years in a subset of the Isle of Wight birth cohort

Abstract Assessment of changes in DNA methylation (DNA-m) has the potential to identify adverse environmental exposures. To examine DNA-m among a subset of participants (n = 369) in the Isle of Wight birth cohort who reported variable near resident traffic frequencies. We used self-reported frequencies of heavy vehicles passing by the homes of study subjects as a proxy measure for TRAP, which were: never, seldom, 10 per day, 1–9 per hour and >10 per hour. Methylation of cytosine-phosphate-guanine (CpG) dinucleotide sequences in the DNA was assessed from blood samples collected at age 18 years (n = 369) in the F1 generation. We conducted an epigenome wide association study to examine CpGs related to the frequency of heavy vehicles passing by subjects’ homes, and employed multiple linear regression models to assess potential associations. We repeated some of these analysis in the F2 generation (n = 140). Thirty-five CpG sites were associated with heavy vehicular traffic. After adjusting for confounders, we found 23 CpGs that were more methylated, and 11 CpGs that were less methylated with increasing heavy vehicular traffic frequency among all subjects. In the F2 generation, 2 of 31 CpGs were associated with traffic frequencies and the direction of the effect was the same as in the F1 subset while differential methylation of 7 of 31 CpG sites correlated with gene expression. Our findings reveal differences in DNA-m in participants who reported higher heavy vehicular traffic frequencies when compared to participants who reported lower frequencies.


Introduction
Evidence for the health impacts of air pollution has been mounting up for several decades [1][2][3]. Exposure to ambient air pollutants is associated with both acute and chronic health effects and the impacts are felt on global and local scales [4]. Interestingly, the observed adverse health effects are seen even at very low levels of air pollution exposure, and it is unclear whether any threshold exists (i.e. a concentration below which there are no effects on health) [5]. The concentration of air pollutants can differ in a small geographic area depending on local ambient conditions [6]. Key environmental factors that significantly affect local air quality includes proximity to traffic, wood burning, coal burning, dry cleaning, motor vehicle exhaust and industrial emissions, among others [7][8][9][10][11][12][13][14]. Exposures to such environmental factors are associated with asthma exacerbation [15], although their contribution to the development of the disease is uncertain [16].
For an environmental factor such as traffic, it is often necessary to investigate simple proxies such as distance to roadways and traffic estimates or counts, to help assign individual exposures and account for spatial variability. For instance, there is increasing evidence that living near heavy traffic is associated with increased rates of asthma, cardiovascular disease and dementia [17][18][19][20][21], and chronic air pollution exposure gradients at such small scales are associated with adverse cardiorespiratory effects [22]. In the absence of neighborhood level, air pollution measurements, proximity to traffic, traffic volume, among other methods, can be employed [6,20,[23][24][25][26][27][28][29][30]. Such substitutes facilitate the characterization of smaller-scale air pollution exposures, and have been operative in some health studies [31][32][33].
Recent evidence indicates that epigenetics may play an important role in mediating the health effects of air pollution [34]. Indeed, it has been suggested that the extent of epigenetic markers can change progressively and help construct cumulative exposure patterns over time [35]. Interestingly, changes in epigenetic markers can result from exposure to a risk factor such as air pollution, and such changes can potentially serve as predictive biomarkers of susceptibility to adverse health [36]. The epigenetic marker of DNA methylation (DNA-m), which is the addition of a methyl group to cysteine in cytosinephosphate-guanine (CpG) dinucleotides sequences in the DNA, is reported to be related to air pollution exposures [37,38], and adverse respiratory health [39], including asthma [40,41].
Changes in the epigenome and gene expression may be induced by exposure to air pollution [34,42] and this is relevant to the development of several pathophysiological processes. Difflerential blood DNA-m in response to air pollution exposure from sources such as traffic has been reported [43][44][45]. We cannot or rarely can directly assess DNA-m in target tissues, such as the lung. However, for many biomarkers, blood changes are considered to constitute a window through which specific processes in other tissues can be assessed. In addition, during development, blood and airways stem from the mesoderm and may represent to have a similar development and susceptibility [46]. For these reasons, the effects of TRAP on epigenome in blood samples represent informative biomarkers of change in the airways.
Given that (i) TRAP exerts its greatest impact on local scales, particularly near roadways [47] and (ii) the mechanistic basis for the effects of TRAP on the epigenome is not well delineated [48], additional studies can provide further evidence and advance the current state of the science [49]. Accordingly, we used the self-reported frequencies of heavy vehicles passing by the homes of study subjects as a proxy measure for TRAP and evaluated their associations with the methylation of CpG sites among 18-year-old participants in the Isle of Wight (IoW) birth cohort, UK (n ¼ 369). Our motivating questions were: 1. Which specific CpG sites are associated with heavy vehicular traffic in the birth cohort? 2. Are there any trends in the association between differential DNA-m (both higher and lower) and the frequency of exposure to heavy vehicular traffic?

Characteristics of Study Population
Eighteen percent of the subjects (n ¼ 67) reported never having any heavy vehicles passing by their homes while 82% reported some heavy vehicular traffic outside their homes (Table 1). About 20% had a history of maternal smoking and nearly 50% were exposed to tobacco smoke outside their homes and before age 4 (Table 1). About a quarter of the subjects were current smokers who started smoking at an average of 14.5 (SD 1.5) years. A vast majority of the subjects present a middle class status (72%) with over 90% still living at home with their parents and 70% living in a private residential property. The average body mass index (BMI) was 23.6 (SD 4.3). In this subset with DNA-m, there were more females than males (66% vs. 34%) due to the study design (following until pregnancy) ( Table 1).

Which Specific CpG Sites Are Associated with Heavy Vehicular Traffic in the IoW Cohort?
There were a total of 371 CpG sites that were associated with heavy vehicular traffic frequency based on ttscreening results. However, we chose the top CpGs with a cutoff percentage of 70 [m ¼ 70 across 100 total iterations (i ¼ 100)] was used to determine the final pool of potentially important CpG sites (in our case 35 of 371 had a cutoff percentage between 70 and 94). Therefore, a final group of 35 CpGs was selected in step 1 (Tables 2 and 3). The 35 CpG sites are listed in the order of significance based on the epigenome-wide association analysis results. Over 30% of these CpGs were located on Chromosome 1. The identified CpG sites were associated with 34 different genes (two CpG sites-cg11156891 and cg12407057-mapped to one gene ANKRD65). A majority of the CpG sites were located in the body of the identified gene (24 of 35); 4 were 200-1500 bases upstream of the transcriptional start site (TSS), while 2 were 0-200 of the TSS; 3 were within the 5 0 untranslated region, and 2 were over 50 kb from the nearest gene (Table 3).
We also assessed answers to other traffic-related questions such as 'How often do cars pass your house or on the street less than 100 meters away?' and 'How frequently are you annoyed by outdoor air pollution (from traffic industry, etc) in your home if you keep the window open?'. However, these did not have much variability nor did they yield any significant results with the ttScreening package. The CpG by CpG analysis also did not show any statistically significant results for Any versus Never reports of heavy vehicular traffic frequency after adjusting for false discovery rate (FDR; all adjusted P-values were !0.4). There was no association between heavy vehicular traffic frequency and cg05575921, located in the aryl hydrocarbon receptor repressor (AHRR) gene. However, there appeared to be an association with self-reported smoking status (among current smokers), tobacco smoke exposure assessed through a questionnaire administered at 10 years and environmental tobacco smoke exposure (Table 4).

Gene Set Enrichment Analysis
Using the bioinformatic resource ToppGene Suite [50], we performed a gene enrichment analysis to determine the pathway(s) associated with genes of the significant CpG sites [the respective genes that had the exact CpG coordinates, or if the CpG was located between two genes (i.e. intergenic CpGs), we selected the gene with the closest proximity to the intergenic CpG]. Input parameters for the gene enrichment analysis were as follows: All 34 genes were included in the training set, the hypergeometric probability mass function was used to calculate P-values, and the FDR was controlled at 0.05 using the Benjamini-Hochberg method.

Association with Air Pollutants in the Comparative Toxicogenomics Database
The analysis did not reveal any links to air pollutants, as there is currently not enough data on air pollution to factor into biological pathway analyses. However, a search of the description and page index of each gene provided information on reported chemicals related to air pollution in the comparative toxicogenomics database [51]. All but three genes were associated with chemical(s) found in air pollution, e.g. 'benzo(a)pyrene', '7,8-dihydro-7,8-dihydroxybenzo(a)pyrene 9,10-oxide', 'smoke' and even 'particulate matter' ( Table 2).

Diseases for Which the Identified Genes Are Enriched
The gene enrichment analysis also identified the following diseases that associated with six of the genes identified with the significant CpG sites in this study An initial ANOVA revealed a total of 24 of 35 CpGs with significantly different DNA-m (P < 0.05) when those who reported no heavy vehicles (never) were compared with those who reported any heavy vehicular traffic (Table 5). Further evaluations (using never, seldom, 10 per day, 1-9 per hour or >10 per hour levels) showed all 35 CpGs had significantly different DNA-m for at least one of the five categories of heavy vehicle traffic P 0.01 (Table 5).
After adjusting for history of maternal smoking, environmental tobacco smoke exposure (0-4 years and/or at 10 years), SES, gender, BMI, current smoking status and/or exposure to smoke outside the home, 34 CpGs remained statistically significant depending on the category of heavy vehicle traffic frequency reported (P 0.05, range for n ¼ 329-369) ( Table 6). We also present results for linear models for the top 35 CpG sites identified with the ttscreening method after adjusting for all confounding factors considered a priori in this study, and the results are similar to Table 6 where associations are still present in 34 of 35 CpG sites for at least one category of the exposure variable (Supplementary Table S1).
In particular, we found 23 CpGs that were more differentially methylated (Fig. 2). Nineteen of these 23 CpG sites are found in the body of the associated genes while the remaining 4 are located in promoter regions (TSS1500 and TSS200) ( Table 3). Conversely of the 11 CpGs that were less methylated with increasing heavy vehicle traffic frequency (Fig. 2), 5 are located in the body of the gene, an additional 5 are found in promoter regions and the last one is $50 kb upstream of TMEM161B (Table 3). Among subjects reporting the two highest heavy vehicle traffic frequencies: 1-9 per hour or >10 per hour, statistical significance was consistently reached for the differential methylation observed at these CpG sites (P 0.05, Table 6). Stratification by current smoking status, revealed similar trends among smokers and nonsmokers. Although statistically significant differences were only detected for 12 CpGs among smokers and 26 CpGs among nonsmokers, mainly for those reporting >10 heavy vehicles per hour (Supplementary Tables  S2 and S3, respectively). Regression results for males only revealed only 10 statistically significant CpG sites with differential methylation: 7 were more methylated and 3 were less methylated (Supplementary Table S4). Results for females indicated 31 significant CpGs with 22 more methylated and 9 less methylated (Supplementary Table S5). The direction of methylation remained the same and the smaller number of significant CpG sites among male subjects is probably due to their smaller sample size in this birth cohort (n ¼ 124).

Results of Replication and Gene Expression Analysis
We replicated the findings for 31 of 35 CpG sites identified in a smaller sample of 140 newborns in the F2 generation. Two CpG sites: cg25895913 (LGI2) and cg00347824 (NSMAF) were associated with traffic frequencies, and the direction of the effect was the same as in the F1 subset. The former CpG site had less methylation, while the latter had more methylation, with increasing vehicular traffic frequency respectively (Table 7). Then, Spearman rank correlation analysis revealed seven CpG sites: cg24843003 (DAZAP1), cg03476673 (CRISPLD2), cg12417992 (SLC6A9), cg04154465 (WNT2B), cg24361098 (BCL11A), cg16668397 (JPH3) and cg17053854 (SEPT9) whose differential methylation was significantly correlated with gene expression (Table 8,    partial r 0.27, P-value 0.05). For an additional two of these CpG sites: cg14162906 (TMEM222) and cg17053854 (SEPT9), there were marginal correlations with expression data from their associated genes (Table 8, 0.05 > P-value 0.06).

Discussion
We aimed to answer two questions: (i) Which specific CpG sites are associated with heavy vehicular traffic in the birth cohort? (ii) Are there any trends in the association between differential DNA-m and the frequency of heavy vehicular traffic? Regarding the first question, we found 35 CpG sites to be associated with heavy vehicular traffic. These CpG sites were associated with 34 different genes (two CpG sites-cg11156891 and cg12407057 mapped to the same gene: ANKRD65). Additionally, 31 of these genes have been reported to be associated with air pollution related chemicals such as benzo(a)pyrene in the comparative toxicogenomics database. In adopting an epigenome-wide approach, as opposed to a candidate gene approach, our analysis adds novel information on epigenetic markers for traffic-related air pollution exposure. These exposure-associated changes in the epigenome could be used to identify exposure to air pollutants, particularly those from incomplete combustion of fuels such as diesel which is often used in buses and trucks. With further research, it can also guide the development of effective clinical and public health interventions and reduce the burden of air pollution-related health outcomes.
For the second question on assessing the association between differential DNA-m and traffic-related air pollution, we found 23 CpGs that were more methylated, and 11 CpGs that were less methylated with increasing heavy vehicular traffic frequency for all subjects after adjusting for confounders. These associations between heavy vehicular traffic frequency and DNA-m measurements persisted after stratification by current smoking status for 26 and 12 CpG sites among nonsmokers and smokers, respectively. Among subjects reporting the two highest heavy vehicular frequency levels: 1-9 per hour or >10 per hour, statistical significance was consistently reached for the differential methylation observed at these CpG sites (P 0.05, Table 6). This exploratory study highlights the fact that epigenetic differences can be observed among subjects exposed to varying frequencies of local traffic.
Our results suggest that exposure to emissions, presumably from the exhaust of heavy vehicles passing by the residences of study subjects, may have an impact on DNA-m. It has been suggested that epigenetic states can convey susceptibility to air pollution, which can lead to biological changes, and ultimately, adverse health [43,52]. DNA-m profiles can provide insight into aspects of biology such as gene activity and regulation, and our gene enrichment analysis offers examples of how the genes associated with the CpG sites are related to various molecular functions, pathways and some rare diseases. Based on the location of the CpG site such as promoter or body, altered methylation may lead to increased transcription, silencing or altered splicing [53][54][55][56]. Hence, a differential transcription level is only one of the consequences of DNA-m. For instance, it has been considered that methylation in promoter regions may lead to changes in gene expression, i.e. gene silencing [57]; and such changes can serve as putative markers or risk factors for altered susceptibility and/or disease states. Additionally, DNA-m can help in identifying CpG sites, and possibly genes, that are more susceptible to environmental exposures [58].
In a replication and gene expression analysis study among 140 newborns from the F2 generation, 6 of the 7 CpG sites that correlated with expression, cg24843003 (DAZAP1), cg12417992 (SLC6A9), cg04154465 (WNT2B), cg24361098 (BCL11A), cg16668397 (JPH3) and cg17053854 (SEPT9), are located in the bodies of the associated genes. The seventh, cg03476673, is found in the 5 0 UTR region of CRISPLD2. Of the remaining 23 CpG sites with corresponding expression data but no statistically significant correlations, 3 are located in the TSS1500 region including cg14162906 (TMEM222) which achieved a marginal significance. The rest are in the following regions: body of the associated gene (n ¼ 14), TSS200 region (n ¼ 2), 5 0 UTR region (n ¼ 2), $50 kb upstream of TMEM161B (n ¼ 1) and $200 kb upstream of SYT16 (n ¼ 1). The association of 31 of 34 genes (identified from CpG sites in this study) to air pollution-related chemicals adds plausibility to potential environment-gene interactions, and can contribute to emerging data that provide a more complete view of environmental exposures. We posit that traffic-related air pollution may be a plausible environmental exposure of interest on the IoW.
With increasing evidence that exposure to air pollution is associated with adverse health outcomes, biologically plausible mechanistic pathways of air pollution's effects, such as oxidative stress, inflammation, coagulation, endothelial function and hemodynamic response, have been implicated [59]. Exposure to ambient particulate matter, which is known to be emitted from diesel truck traffic, is associated with decreased lung function and increases in respiratory disease and symptoms such as asthma exacerbation [47,[60][61][62]72]. Exposure to gaseous air pollutants including nitrogen species (e.g. NO 2 , NO, NOx) are also associated with deleterious effects such as bronchial reactivity, airway oxidative stress, pulmonary and systemic inflammation [63][64][65][66]. Several epidemiologic studies have reported that short-term increases in ambient pollutants such as PM 2.5 and nitrogen dioxide (NO 2 ) are associated with increases in airway inflammation in children and adults [67][68][69][70][71][72][73].
A recent epigenome-wide meta-analysis by Gruzieva et al. [74] provides evidence on the association between prenatal air pollution exposures and differences in the methylation of several genes in cord blood. In particular, the authors found significant associations between NO 2 exposures and DNA-m for CpG sites that mapped to genes in the solute carrier family (SLC), family with sequence similarity (FAM) and transmembrane proteins (TMEM [TMEM161B]) related to three gene superfamilies from the Gruzieva et al. meta-analysis were associated with heavy vehicular traffic frequency and DNA-m in our study. The association of one of the CpG sites with FAM132A (codes for an important anti-inflammatory adipokine [75]), strengthens the hypothesis that inflammation may be a possible mechanism though which ambient air pollution affects human health [76].
While underlying molecular alterations of air pollution mediated adverse health remain to be further investigated, another recent study with two European cohorts identified decreasing DNA-m on CpG island shores, shelves and gene bodies with increasing concentrations of nitrogen oxide (NO) species [77]. NO species are currently the best available indicators of spatial variation and mixtures of outdoor urban air pollution such as traffic [78]. Our analysis did not reveal CpG sites associated with the inflammatory genes mentioned in the above study, and to the best of our knowledge, the significant CpG sites reported in our study have not been reported in previous air pollution studies. This may be due to differences in (i) study populations, (ii) exposure assessment and concentrations, (iii) complex multiple biological pathways or (iv) a combination of any of the previous   continued Frequency of heavy vehicle traffic and association with DNA methylation | 11 three reasons. These newly identified CpG sites and associated genes are certainly worth exploring in larger cohorts. In our study, the two CpG sites that were associated with vehicular traffic in both the F1 and F2 generation may be reflective of the effects of TRAP exposures at these two loci. It also suggests possible prenatal exposures to traffic-related air pollutants in the F2 generation. Secondly, correlation between DNAm and gene expression at 7 of 31 CpG sites (and three marginal correlations) supports the hypothesis that DNA-m is a potential mechanism through which traffic-related air pollutants can affect gene expression. Three of these seven CpG sites are associated with genes previously identified in the literature to be related to inhalation. For instance, CRISPLD2 has been identified as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells [79]. WNT2B has been reported to be associated with embryonic origins of the lung since the inactivation of WNT2A and WNT2B, resulted in complete absence of lung development [80]. Methylation of JPH3 from sputum samples is a sensitive and specific predictor of chronic mucous hypersecretion in former male smokers [81]. The lack of 100% replication and correlation in our analysis may be due to small sample sizes and exposure misclassification from the use of questionnaire data rather than air pollution data (for instance the questionnaire administered at 18 years specified 'heavy vehicle' while the questionnaire during pregnancy only mentioned 'vehicle'). While our results must be interpreted with caution, there are additional studies that add to the evidence that adverse effects of air pollution that can occur when one is exposed. A recent study, which did not replicate its results in a separate independent cohort, found that living close to major roadways at birth was associated with differential cord blood methylation [82]. Another study, which was also not replicated in an independent cohort found significant associations between long-term air pollution exposure (NO 2 ) and DNA-m for seven CpG sites (Bonferroni corrected threshold P < 1.2E-7) [83].
With continuing indication that exposure to ambient air pollutants may contribute to adverse public health [1,84], further research is needed to identify the components of air pollution that determine its toxicity and a pristine environment such as the IoW could offer a suitable environment to study ambient air pollutant toxicity. The constituents of the pollution potentially generated by heavy vehicles may need to be identified so that early preventative and possible control strategies can be targeted efficiently. Whether these findings raise the risk for future cellular malfunction and disease is unknown. One main reason for the persistence (or the lack thereof) of such findings could be attributed to small sample sizes. In our case, the nonsmokers were consistently between 248 and 270 while smokers were between 78 and 95 subjects. Another reason could be due to the small magnitude effect sizes that are common with environmental epigenetic research [85]. Profiling of the epigenome over time in this population will help improve understanding of TRAP exposures and how the epigenome responds to this stimuli. Additionally, we found that secondhand smoke exposure is represented by the questions posed to subjects about tobacco smoke exposures since these variables were associated with the methylation of cg07555921 (AHRR), while the exposure variable was not. Therefore, these observed effects of heavy vehicular traffic on DNA-m may be without the contribution of this type of air pollution. Further studies in the future may be needed to examine this in depth. There are some limitations to this study. First in this study, our exposure variable of interest, heavy vehicular traffic frequency, was ascertained by questionnaire responses from study subjects and we did not attempt to conduct exposure assessment inside or outside their residences, and these analyses were based on current residences (at the time of the blood draw at 18 years old in the F generation) as opposed to conditions in former places of residence. Secondly, the associations observed in this study are informative. However, further analysis may be needed to assess other self-reported exposures such as tobacco exposures, particularly on a cumulative scale. Given that the data in this pilot study are from a birth cohort to which a third generation follow-up has been added, further investigation of the DNA-m of the same subset of this population at earlier time points or in their offspring could address some of these limitations. Thirdly, methylation data were obtained from whole blood but not from specific cell subgroups, due to cost, but while differential methylation may or may not be present in all cell subsets, we believe that important biological insights still may be gained from studying DNA-m in whole blood [86]. Moreover, we did adjust for the cell types in the screening step of the analysis, thereby overcoming this limitation. Additionally, multiple studies have validated the 450K DNA-m array from Illumina [87][88][89], and this assay is generally accepted in the scientific literature. Hence we did not see a necessity to additionally test the results of specific CpGs from the 450K DNA-m array with methyl-specific qPCR. The use of bisulfite sequencing can be challenging, since it reduces genome complexity and some of the methods may not differentiate between methylcytosine and hydroxymethylcytosine. The incorporation of appropriate controls for bisulfite reactions and careful interpretation of DNA-m level after accounting for cell types can overcome some of these challenges [90]. An overview of major difficulties related to bisulfite sequencing and how to overcome them are presented in the review by Li et al. [91]. Although the correlations between CpG sites and expression data reached statistical significance, the coefficients were weak. One may consider this as a limitation of our study; however, gene expression is influenced by multiple factors and our analysis only focus on the role of DNA-m on gene expression. Future studies with large sample sizes need to further investigate associations between traffic-related DNA-m and gene expression, taking other factors such as genetic polymorphisms and network of related genes, into consideration. Finally, since this is the first study that shows an effect of varying heavy vehicular traffic frequency on DNA-m among residents on the Isle, further replication of these associations in an independent cohort is needed.

Conclusions
Our findings reveal differences in DNA-m in participants who reported higher heavy vehicular traffic frequencies when compared with participants who reported lower frequencies. Such findings may be attributed to TRAP exposure and suggest that further studies are needed.

Study Population
Subjects in this study are from a whole population birth cohort established in 1989 on the IoW, UK, to prospectively study the natural history of allergies and asthma. This cohort has been previously described in detail elsewhere [92]. Informed consents and detailed information from questionnaires were obtained from participants at recruitment and at each follow-up year: 1, 2, 4, 10 and 18 years [93]. The questionnaires for the entire birth cohort study are for study-specific objectives, while asthma and allergy symptom questions are from the validated International Study of Asthma and Allergies in Childhood (ISAAC) [92]. Local Research Ethics Committees approved of the parent study, and the Institutional Review Board at the Medical University of South Carolina approved the current study. In this exploratory analysis, we focus on 369 individuals (245 women and 124 men) with DNA-m measurements at age 18 years. Due to the original study question of inheritance via females, we included more females than males at 18 years.

DNA-m Analysis
DNA was extracted from peripheral blood samples and its concentration was determined by Qubit quantitation, as described previously [94]. Genome-wide DNA-m was assessed using the Illumina Infinium Human Methylation 450 beadchip (Illumina, Inc., CA, USA), which interrogates >484 000 CpG sites associated with approximately 24 000 genes. Arrays were processed and imaged using the manufacturer's recommendations, as described elsewhere [95]. Multiple identical control samples were assigned to each bisulfite conversion batch, and the samples were randomly distributed on microarrays to assess assay variability and to control batch effects respectively.
Methylation levels (b values) were calculated for queried CpG loci using the methylation module of GenomeStudio software [96]. DNA-m levels for each CpG were estimated as the proportion of intensity of methylated (M) over the sum of methylated (M) and unmethylated (U) probes, b ¼ M/[c þ M þ U] with c being a constant to prevent dividing by zero [97]. DNA-m levels were corrected for batch effect using 'IMA' and 'ComBat' packages in R [98]. M-values were calculated as log 2 ratio of the intensities of methylated probe versus unmethylated probe, and used in subsequent analysis [99]. The detection P-value for each CpG site was used as a quality control measure of probe performance and CpG sites with: (i) detection P-value > 0.01 in >10% of the samples and (ii) probe single nucleotide polymorphism (SNP) excluded from all analyses.
We estimated the proportion of cell types in adult peripheral blood using the estimateCellCounts() function in minfi package LGI2 >10/hour

Exposure Assessment
The exposure variable of interest, the frequency of heavy vehicular traffic, was determined through questionnaire responses from the subjects to the question: How often do heavy vehicles (e.g. trucks/buses) pass your house or on the street less than 100 meters away? The five-point response included: never, seldom, 10 per day, 1-9 per hour or >10 per hour. We also assessed answers to other air pollution related questions such as 'How often do cars pass your house or on the street less than 100 meters away?' and 'How frequently are you annoyed by outdoor air pollution (from traffic industry, etc) in your home if you keep the window open?'. All subjects were approximately 18 years old when the questionnaire containing these questions was administered. While we have not seen of any study in the literature that has used the same question to assess exposure to TRAP, others have used questionnairederived assessments as air pollution exposure variables [102,103]. Others have used such questions along with proximity to roadways, air pollution measurements, land use regressions together with the validated and widely used International Study of Asthma and Allergies in Childhood (ISAAC) questionnaire to successfully characterize health effects of interest [104][105][106].

Covariates of Interest
For this exploratory study, the covariates of interest obtained from the subjects' mothers were as follows: (i) gender; (ii) maternal smoking status during pregnancy obtained from questionnaires at birth of the subject; (iii) tobacco smoke exposure obtained through questionnaires completed at birth and at ages 1, 2, 4 and 10 years. Other covariates were obtained from the questionnaire administered to the subjects at age 18: (iv) socioeconomic status (SES) ascertained from the question 'what is your family's annual income (estimate)?'; (v) current smoking status, and age subject started to smoke if applicable; (vi) exposure to smoke outside the home ascertained by the question 'are you regularly exposed to smoking outside the home?'; (vii) BMI calculated from height and weight measurements obtained during the 18-year follow-up, using the following formula: weight (kg)/height (m)2. In addition, we considered the type of residential property the subjects lived in (rented privately, rented council/housing association, owned privately or other), whether the subjects were still living with their parents, and the duration of living in the present house (obtained in the course of the 4-year follow-up).

Statistical Analysis
Descriptive statistics and chi square tests were used to assess whether the 369 subjects in this study were representative of the total birth cohort. Then, we conducted statistical analyses in two main steps: Step 1: Epigenome-wide Association Analysis Screening tool. We employed ttScreening package (an epigenome-wide DNA-m sites screening tool) to examine CpGs that are potentially associated with the frequency of heavy vehicles passing by subjects' homes at age 18 years. This approach to screen epigenome-wide data was used since it generally performs better and has the potential to control both types I and II errors [107]. Specifically, the ttScreening package conducts surrogate variable analysis, unexplained variation in the data is removed, prior to an iterative training-testing procedure. This training-testing method performs better than methods such as the FDR and the Bonferroni in reducing false-positive and falsenegative results. In addition to providing internal validation, the use of training-testing builds more generalized models than those constructed by traditional methods, and can detect additional loci undetectable using traditional methods [107]. The analytical methods implemented in the package employed a screening process that filtered non-informative CpGs through 100 iterations of a training-and-testing (TT) process with robust regressions. We followed the default settings for the ttScreening method: (i) 2 of 3 of the data for training, (ii) the 'two-step' method for surrogate variable analysis (sva.method) [108], (iii) 100 iterations for the total number of screenings (iterations), (iv) 50% as the cutoff proportion of those 100 iterations (cv.cutoff) and (v) 0.05 significance level for the training (train.alpha) and testing data (test.alpha). The 100 iterations are recommended by the authors of the ttScreening package to create a balance between computing efficiency and adequate resampling to arrive at true associations. Also 50% is the default for the cutoff proportion since the informative CpGs are usually sparse in comparison to the candidate CpG sites, and the authors' simulations identified 50% cutoff percentage as suitable for small and large sample sizes [107]. The independent and dependent variables were heavy vehicular traffic frequency and DNA-m, respectively. A CpG was selected as an informative site if it showed statistical significance in at least 70% of iterations. The ttScreening( ) function automatically adjusts for multiple testing using three methods, including FDR, Bonferroni and the TT method [109].
CpG by CpG analysis. As an alternative to the ttscreening method, we also conducted multiple linear regressions with the M values of each CpG while adjusting for all covariates selected apriori and calculated adjusted P-values for the multiple comparisons (p.adjust ( ) command in base R). The exposure variable in this case was classified as 'Any' versus 'No' heavy vehicular traffic frequency. All procedures in Step 1 were conducted with R (version 3.4.2) [110].
Tobacco smoke exposure. Prior epigenome-wide association studies have shown that the methylation of cg05575921 located in the AHRR gene is a robust indicator of tobacco smoke exposure [111,112]. Even in different demographics, smoking histories and rates of false-negative self-report of smoking behavior, this CpG site can reliably detect smoking status [113]. Additionally, a recent study revealed that high levels of recent secondhand smoke exposure was inversely associated with DNA-m of cg05575921 in monocytes from nonsmokers, although the effects were weaker when compared with active smokers [114]. Hence we conducted linear regression models with self-reported smoking status and secondhand smoke exposures to examine the relationships between this CpG site and tobacco smoke exposure, as well as our exposure variable: heavy vehicular traffic frequency.
Step 2: Associations between the Frequency of Heavy Vehicular Traffic and DNA-m To investigate preliminary associations with heavy vehicular traffic frequency, we assessed differences in unadjusted DNAm of the CpGs identified in the ttscreening method in Step 1 using analysis of variance (ANOVA) on only heavy vehicular frequency. Then, the CpGs were further tested in multiple linear models that included potential confounders to assess their association with the heavy vehicular traffic frequency. A general form of the model is seen in Equation (1): where DNA:M iv refers to the DNA-m for the ith subject reporting vth category of heavy vehicular frequency, a is the intercept and is the error term. The coefficient b v is the deviation of grand mean for vth category of heavy vehicle traffic frequency (seldom, 10 per day, 1-9 per hour and >10 per hour) compared to never. The lsmeans statement was used to derive model adjusted means.

Modeling and Variable Selection
For a covariate to be considered a confounder, the estimate of the regression coefficient for heavy vehicle traffic frequency in the reduced model (that excluded the confounder of interest) had to fall outside the range of 10% of the estimate of the full model (the full model includes all covariates considered apriori in this study) [115]. The final models for each CpG site included gender and any identified confounders. Models were assessed for all subjects and then stratified by gender and current smoking status since exposure to tobacco can lead to extensive genome-wide changes in DNA-m [116].

Adjusted DNA-m Means and Trend Test
We performed Dunnett's tests to compare model adjusted (marginal) means from four heavy vehicle traffic frequency categories (seldom, 10 per day, 1-9 per hour or >10 per hour) against a control group mean (never) to check for statistically significant differences. We also used PROC IML's ORPOL function in SAS [117] to obtain appropriate coefficients for contrast statements to test for linear trends in increasing heavy vehicular frequency with increasing or decreasing DNA-m measurements, only when marginal means were significantly different from the control mean (never category). When marginal means did not significantly differ from the control category, the results were not provided. P values <0.1 were considered statistically significant for the trend tests. Finally, marginal means for DNA-m were plotted by category of reported heavy vehicle traffic frequency.
Step 2 was performed with the SAS statistical package (version 9.4; SAS Institute, Cary, NC, USA). All plots were derived using 'ggplot' function in R.

Study Population
Thirty-one of 35 significant CpG sites found in the present study for the 369 subjects in the F1 generation were tested in the DNA-m and gene expression data from cord blood in the newborn cohort, the F2 generation (n ¼ 155, born 2006-2013). This step constitutes a replication of the CpGs in a semiindependent cohort. In the F2 generation, there were 76 males and 79 females and the average birthweight was 3459.3 g (standard deviation: 504.6). The median birthweight was 3515 g (n ¼ 148). The exposure variable was obtained from the questionnaire administered to the mothers during pregnancy. The mothers' answers to this question were used as the exposure (independent) variable of interest: How often do vehicles pass your house or on the street less than 100 meters away? The answers were never, seldom, 10 per day, 1-9 per hour or >10 per hour. When a mother answered the question once instead of three times, this answer was assigned as the frequency of vehicles that passed by the home during the entire pregnancy. If she answered two or three times, the lowest frequency was assumed to be her exposure. This was to be conservative on their exposures since this pregnancy questionnaire did not specify 'heavy vehicles', compared to the question posed to them (F1 generation) at age 18. It also allowed for a distribution of responses as follows: never (2), seldom (8), 10 per day (26), 1-9 per hour (39) and >10 per hour (72). Eight mothers did not provide an answer to this question during any of the three trimesters and were excluded from the remaining analysis. Also there were 31 of 35 top CpG sites available for the F2 newborn subset.

Gene Expression Array
At birth, IoW F2 cord blood samples were collected into PAXgene Bone Marrow RNA Tubes and RNA extracted using PAXgene RNA kits (PreAnalytiX GmbH, Switzerland). RNA integrity was verified with the Agilent 2100 Bioanalyzer system. Frequency of heavy vehicle traffic and association with DNA methylation | 17 Genome-wide mRNA expression was assessed via one color (Cy3) experiments with the Agilent (Agilent Technologies, Santa Clara, CA) SurePrint G3 Human Gene Expression 8Â60k v2 microarray kits. Array content was sourced from RefSeq, Ensembl, UniGene and GenBank databases and provides full coverage of the human transcriptome in 50 599 biological features (including replicate probes and control probes). The oligos were 60mer in length and each transcript was tagged at least once and some had multiple tagging oligos for genes with documented splice variants. Data QC indices and analyses were performed with Agilent GeneSpring software. These data were then percent shift normalized and log 2 -transformed.

Statistical Analysis
DNA-m data: Linear regression models, Dunnett's multiple comparison tests and trend tests were used to assess the relationship between the frequency of vehicular traffic and DNA-m, as previously described for the subset from the F1 generation. The models were adjusted for gender and birthweight. Successful replication was defined as having the same direction of differential methylation and a P-value of <0.05.
Gene expression data: We calculated partial Spearman's rank correlations between the DNA-m at 31 of 35 CpG sites and gene expression data for the associated genes while controlling for cell types (Bcell, CD4T, CD8T, gran, mono, NK and nRBC). Since cord blood includes nucleated red blood cells, we used the cell references provided by Bakulski and colleagues [118,119].