-
PDF
- Split View
-
Views
-
Cite
Cite
Ming Kei Chung, John S House, Farida S Akhtari, Konstantinos C Makris, Michael A Langston, Khandaker Talat Islam, Philip Holmes, Marc Chadeau-Hyam, Alex I Smirnov, Xiuxia Du, Anne E Thessen, Yuxia Cui, Kai Zhang, Arjun K Manrai, Alison Motsinger-Reif, Chirag J Patel, Members of the Exposomics Consortium , Decoding the exposome: data science methodologies and implications in exposome-wide association studies (ExWASs), Exposome, Volume 4, Issue 1, 2024, osae001, https://doi.org/10.1093/exposome/osae001
- Share Icon Share
Abstract
This paper explores the exposome concept and its role in elucidating the interplay between environmental exposures and human health. We introduce two key concepts critical for exposomics research. Firstly, we discuss the joint impact of genetics and environment on phenotypes, emphasizing the variance attributable to shared and nonshared environmental factors, underscoring the complexity of quantifying the exposome’s influence on health outcomes. Secondly, we introduce the importance of advanced data-driven methods in large cohort studies for exposomic measurements. Here, we introduce the exposome-wide association study (ExWAS), an approach designed for systematic discovery of relationships between phenotypes and various exposures, identifying significant associations while controlling for multiple comparisons. We advocate for the standardized use of the term “exposome-wide association study, ExWAS,” to facilitate clear communication and literature retrieval in this field. The paper aims to guide future health researchers in understanding and evaluating exposomic studies. Our discussion extends to emerging topics, such as FAIR Data Principles, biobanked healthcare datasets, and the functional exposome, outlining the future directions in exposomic research. This abstract provides a succinct overview of our comprehensive approach to understanding the complex dynamics of the exposome and its significant implications for human health.
Introduction
The exposome encompasses an individual’s life-course environmental exposures.1,2 The original context focused on studying the environment with objective and higher precision methodology such as using exposure biomarkers. As the concept spans across multiple disciplines in medicine, sciences, and public health, it was later elaborated by others from different perspectives.3-6 Nevertheless, the ultimate goal remains the same: to quantitatively characterize the phenomenon of multiple exposures in humans, and ultimately how the totality of human exposure influences phenotypic traits, and it is necessary for investigators to understand two fundamental concepts that can be used to guide research and development in exposomics.
Concept 1: The contribution of genetics and the environment to phenotype
To ascertain both the total contribution of the environment, and to attribute specific factors of the environment to this contribution, it is essential to account for time-varying, repeated and mixture exposures in the analysis to explain differences in phenotype not currently explained by candidate environmental factors and genetic factors and to solve Equations (1) and (2). From twin-based and genome-wide investigations, the total contribution of genetics is anywhere from 30%–50%,9,10 and shared exposome is 10%,9 leaving a large amount to be described by the nonshared exposome.
Concept 2: Enhancing exposomic measurement at an epidemiological scale through cohort studies
No single universal method and approach is known that can capture a representative landscape of the totality of exposures, and exposomics studies typically require a combination of multiple methods for such a purpose.11-13 The increasing use of large and complex observational studies such as the National Health and Nutrition Examination Survey (NHANES) and the UK Biobank—with comprehensive measurement of both genes and exposures—are becoming increasingly prevalent in health studies. New analytical skills are essential to perform data-driven research to understand the contribution of genetics and exposures as well as complex gene and environment interactions to phenotypic outcomes to address Equations (1) and (2). Apart from basic statistical inference of the associations between exposures and diseases, disentangling and identifying important exposures and building predictive models are also becoming routine analytical procedures.
This essay aims to describe the above concepts from a data science perspective, providing a guide for the next generation health researchers to eventually examine and appraise exposomic studies (Figure 1). Finally, we share our view on potential topics that could have major influences concerning the development and practice of exposomics in the coming decades. We will discuss exposome-driven analysis as an extension of an observational epidemiological study, where the exposome and phenome are measured in human samples, in contrast to experimental studies where the investigator can assign individual subjects to different predefined exposure groups.

Data science and exposome research
To establish the exposomic concept as a research paradigm, commonly investigated environmental factors in human observational investigation can be classified into three domains—general external, specific external, and internal,1 or alternatively, four categories that include ecosystems, physical/chemical, lifestyle, and social.4 These schemas encourage the adoption of a cross-disciplinary perspective on mixture of exposures when answering a broad research question on the role of the exposome in health outcomes. Specifically, the shared and nonshared environment contributions introduced in Concept 1 consist of both general external and specific internal factors. On the other hand, the internal exposome can be thought of as phenotypic changes that are induced when exposed to the external exposome,14 therefore, enabling more in depth analysis of the relationships between exposures and outcomes such as “mediation analysis.”15
To effectively communicate exposome research, it is essential to convey key data science concepts that complement introductory public health courses and are extensible to research areas more specific to traditional disciplines such as air pollution, chemical mixtures, and climate change. We begin with the exposome-wide association study (ExWAS), as it is instrumental in how to estimate the quantities and factors introduced in Concept 1 and 2.
Overview of exposome-wide association studies (ExWASs)
ExWAS is a data-driven analytical approach for conducting large-scale exploratory studies in exposomics, inspired by the GWAS paradigm in human genetics. It is robust as it works across different study designs, including but not limited to: cross-sectional, cohort, longitudinal, and (nested) case-control investigations. It is also highly interpretable as it can be driven by a variety of methods based on regression and other techniques. Fundamentally, ExWAS attempts to systematically model all the pairwise relationships between a single phenotype and multiple exposures, with a goal to identify statistically significant associations while controlling for the effects of multiple comparisons. For example, ExWAS was used to study the association of 266 environmental factors with type 2 diabetes and both risk (eg, heptachlor epoxide) and protective factors (eg, β-carotenes) were identified.16
In a typical ExWAS, the aim is to identify analytically important exposure-outcome pairs across all measured exposures. The choice of regression method, whether it is a basic linear regression or its extension, depends largely on the study design.17-19 Statistical significance, or signal to noise, is estimated through the p value associated with the beta coefficients of the corresponding predictors. In a traditional hypothesis-driven study, the threshold for type I error rate is set to 5%. Simply put, it means that if a null hypothesis is repeated 100 times, then five of them could be statistically significant, ie, false positives. The phenomenon of inflated significant findings by conducting many statistical tests is a version of data dredging.20 In ExWAS, spurious associations due to multiple comparisons are controlled, for example, using false discovery rate (FDR),21 which is an expected ratio of false positive to total positive findings in a study. Similarly, a Manhattan plot showing −log10 p values enables quick visual inspection of all the associations (Figure 2).

A Manhattan plot illustrating the findings of an ExWAS for type 2 diabetes. The X-axis represents the exposures, while the Y-axis shows the corresponding probability values. Each point in the plot signifies the association test for a single exposure. The red horizontal line indicates the threshold for statistical significance. Reproduced from Patel et al.,16 used under Creative Commons Attribution License.
Since ExWAS is a discovery-based approach, confirmation of statistically significant results is essential. In the simplest form, samples in a study could be split into two parts—one for discovery and one for validation, whereby the investigator seeks concordance of association statistics (eg, association size or beta-coefficient) in more than one sample of the study, such as a held-out sample, an independent survey cycle, or entirely new cohort.13,16 Replicability can be assessed through an independent set of data.22 To assist in conducting and interpreting results from ExWAS, we have also tabulated key resources, from identifying datasets to locating R statistical packages for analysis, in Table 1.
Epidemiological data science resources for ExWAS analyses and interpretation
Type . | Name . | Description . |
---|---|---|
Epidemiological data source | ||
Data portals | ||
The Trans-Omics for Precision Medicine (TOPMed): https://topmed.nhlbi.nih.gov/ | The TOPMed program consists of ∼180k participants from more than 85 different studies. It consists of ancestrally and ethnically diverse sets of participants, focusing on phenotypes of heart, lung, blood, and sleep disorders with multi-omic data such as whole-genome sequencing (WGS) data and other omics (eg, transcriptomics, epigenomics, metabolomics and proteomics) data integrated with molecular, behavioral, imaging, environmental, and clinical data. | |
Database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap/ | dbGAP is a NIH-maintained database that archives the data and results from genotype-phenotype studies in humans. It contains both open access and controlled access data. Data is organized as studies with phenotype data and genotype data such as SNP assays, methylation data, CNVs, genomic sequencing, exome data, expression arrays, RNA-Seq data, etc. | |
Environmental influences on Child Health Outcomes: https://echochildren.org | The “ECHO Cohort” integrates numerous child cohorts and cover 5 outcomes, including obesity, pre-, peri-, and postnatal outcomes, upper/lower respiratory disease, wellness, and neurodevelopment totallying over 16k children. | |
Human Health Exposure Analysis Resource (HHEAR): https://hhearprogram.org/ | The HHEAR is a centralized network of exposure analysis services and expertise available to eligible researchers who want to include or broaden exposure analysis to their human health studies. The HHEAR Data Center maintains a repository for HHEAR data including epidemiologic, biomarker and environmental exposure data, and associated data science tools. | |
Cohorts | ||
The All of Us Research Program: https://allofus.nih.gov/ | The All of Us Research Program aims to collect health data from >1 million participants in the USA with a focus on involving previously underrepresented populations. The cohort consists of ∼429 k participants with electronic health records, survey data, genomic data, labs and physical measurements and biospecimens. The surveys include questions on overall health, lifestyle, medical history, and social determinants of health. | |
The UK Biobank: https://www.ukbiobank.ac.uk/ | The UK Biobank is a large-scale biomedical database consisting of ∼500 k participants from the UK with genetic, health and survey data. It is a longitudinal study with health and lifestyle survey data, physical measurements, biospecimens, imaging, electronic health records, biomarkers, wearables and multi-omic data (genotyping, whole genome sequencing and whole exome sequencing). | |
The Million Veteran Program (MVP): https://www.research.va.gov/mvp/ | The MVP investigates the roles of genetics, lifestyle, exposures and military experiences on the health and wellness of Veterans in the USA. The MVP cohort consists of ∼930 k participants with electronic health records, self-reported surveys and genotype data. The surveys comprise of information on health, lifestyle, military experiences and exposure, medical history and diet. | |
The Nurses' Health Study (NHS): https://nurseshealthstudy.org/ | The NHS consists of three prospective cohorts with ∼275 k nurses, primarily female, with questionnaire data and biospecimens. Questionnaires are administered biennially and include questions on health, medical history, lifestyle, diet, behavior, environment and nursing occupational exposures. Biospecimens such as blood, urine, buccal DNA and toenail samples are available for a subset of participants. | |
The Health Professionals Follow-up Study (HFPS): https://www.hsph.harvard.edu/hpfs/ | The HFPS is an all-male study designed to be the complement to the primarily female Nurses Health Study. This study is comprised of ∼22 k males in health professions such as dentists, pharmacists, optometrists, podiatrists, osteopaths, veterinarians, etc. Questionnaires are administered biennially and include questions about diseases such as cancer, heart disease and other vascular diseases, and health-related topics like smoking, physical activity, lifestyle, diet and medications. | |
The National Health and Nutrition Examination Survey (NHANES): https://www.cdc.gov/nchs/nhanes/index.htm | The National Health and Nutrition Examination Survey (NHANES) is a vital program conducted by the CDC in the USA, designed to assess the health and nutritional status of the US population through interviews and physical examinations. NHANES uses a complex, multistage probability design to ensure its sample is representative of the US civilian noninstitutionalized population. NHANES plays a crucial role in exposomic sciences by providing extensive data on environmental exposures, such as various toxins, and contributing to biomonitoring efforts. The data gathered are pivotal for epidemiological studies exploring the relationship between environmental factors and health outcomes. | |
The Personalized Environment and Genes Study (PEGS): https://www.niehs.nih.gov/research/clinical/studies/pegs/ | PEGS integrates genetic and environmental data for ∼10 k racially and ethnically diverse participants and includes multi-dimensional data consisting of phenotypic data, genomic data, and extensive questionnaire-based and geospatial estimates of exposome-wide environmental exposures. The surveys include questions on health, lifestyle, medical history and various exposures such as residential and occupational environmental exposures, medication use, physical activity, stress, sleep, diet and reproductive history. | |
Human Early Life Exposome (HELIX) project: https://helixomics.isglobal.org/ | The HELIX project is a resource of multi-omics and exposome data for 1301 mother-child pairs from six European cohorts. The ExWAS (Exposome-wide association analyses) catalog can be used to query and download findings from the HELIX ExWAS. Summarized results for other omic analyses are also available for download. | |
Location-based exposure sources | ||
The Center for Air, Climate, and Energy Solutions (CACES): https://www.caces.us | The CACES land-use regression (LUR) models provide estimates of outdoor concentrations for multiple pollutants by census tract. The CACES reduced complexity models (RCMs) estimate the impact of the emissions of multiple pollutants on human health. | |
NASA Earthdata Collection: https://www.earthdata.nasa.gov/ | The Earthdata collection includes measurements of the Earth’s atmosphere, land, ocean, and cryosphere from a variety of sources, including sensor data from satellites and aircraft platforms, in situ measurements, field campaigns, and model estimates. These measurements can aid in the understanding of climate change, extreme weather patterns, hazards and disasters, air quality and water resources levels. | |
CDC/ATSDR Social Vulnerability Index (SVI): https://www.atsdr.cdc.gov/placeandhealth/svi/ | The SVI uses US Census data to calculate the social vulnerability at the census tract level (subdivisions of counties for which the Census collects statistical data). Each census tract receives an SVI rank based on 16 social factors which are also grouped into four related themes—socioeconomic status, household characteristics, racial and ethnic minority status, and housing type and transportation. | |
ATSDR Environmental Justice Index: https://www.atsdr.cdc.gov/placeandhealth/eji/ | The EJI ranks the overall effects of environmental injustice on health for each census tract. It ranks each census tract on 36 environmental, social, and health factors and groups them into ten domains and three overarching modules—the environmental burden, social vulnerability and health vulnerability modules. | |
Statistical analysis | ||
rexposome: https://www.bioconductor.org/packages/release/bioc/html/rexposome.html | An R package for the analysis of exposome data. Offers a set of functions to incorporate exposome data into the R framework and a series of tools to analyze exposome data. | |
omicRexposome: https://bioconductor.org/packages/release/bioc/html/omicRexposome.html | omicRexposome uses MultiDataSet for coordinated data management, rexposome for defining exposome data, and limma for association testing to facilitate the study of associations between exposures and omic data. | |
MR-Base: https://www.mrbase.org/ | Platform for Mendelian Randomization using published GWAS summary statistics | |
Chemical Information & Interpretation | ||
Exposome-Explorer: http://exposome-explorer.iarc.fr/ | The Exposome-Explorer is a database of biomarkers of exposure to environmental risk factors for diseases. It contains information on known biomarkers of exposure to dietary factors, pollutants, and contaminants measured in population studies. | |
The Blood Exposome Database: https://bloodexposome.org/ | Collated chemical lists from metabolomics, systems biology, environmental epidemiology, occupation, toxicology and nutrition curated from automated text mining from PubMed and PubChem databases. | |
The Toxic Exposome Database (T3DB): http://www.t3db.ca/ | The database currently houses 3678 toxins described by 41 602 synonyms, including pollutants, pesticides, drugs, and food toxins, which are linked to 2,073 corresponding toxin target records. Altogether there are 42 374 toxin, toxin target associations. Each toxin record (ToxCard) contains over 90 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. | |
CompTox Chemicals Dashboard: https://comptox.epa.gov/dashboard/ | The Dashboard contains chemistry, toxicity and exposure information for over one million chemicals, with over 420 chemical lists based on structure or category. It also enables access to the information in ExpoCast and ToxCast. Notably, one can also get access to EPA's Distributed Structure-Searchable Toxicity (DSSTox) Database, which contains accurate mapping of bioassay and physiochemical data on chemical substances to their chemical structure. | |
PubChem: https://pubchem.ncbi.nlm.nih.gov/ | Open chemistry database of the National Institutes of Health (NIH) since 2004. Small and large molecules with data on structure, identifiers, physiochemical properties, biological activity, as well as health, safety and toxicity data. Currently, it has over 115M compounds. Contributed to by academics, government agencies, chemical vendors and journal publishers. | |
Chemical Entities of Biological Interest (ChEBI): https://www.ebi.ac.uk/chebi/ | ChEBI is a dictionary of molecular entities. It focuses primarily on small chemical compounds that intervene in the biological processes of living organisms. Currently, it has over 60 000 annotated compounds. | |
Tox21: https://ntp.niehs.nih.gov/whatwestudy/tox21/ | Testing of commercial chemicals and pesticides, food additives and chemical compounds in hundreds of cellular based assays and transcriptomic assays with dose-response characterization. |
Type . | Name . | Description . |
---|---|---|
Epidemiological data source | ||
Data portals | ||
The Trans-Omics for Precision Medicine (TOPMed): https://topmed.nhlbi.nih.gov/ | The TOPMed program consists of ∼180k participants from more than 85 different studies. It consists of ancestrally and ethnically diverse sets of participants, focusing on phenotypes of heart, lung, blood, and sleep disorders with multi-omic data such as whole-genome sequencing (WGS) data and other omics (eg, transcriptomics, epigenomics, metabolomics and proteomics) data integrated with molecular, behavioral, imaging, environmental, and clinical data. | |
Database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap/ | dbGAP is a NIH-maintained database that archives the data and results from genotype-phenotype studies in humans. It contains both open access and controlled access data. Data is organized as studies with phenotype data and genotype data such as SNP assays, methylation data, CNVs, genomic sequencing, exome data, expression arrays, RNA-Seq data, etc. | |
Environmental influences on Child Health Outcomes: https://echochildren.org | The “ECHO Cohort” integrates numerous child cohorts and cover 5 outcomes, including obesity, pre-, peri-, and postnatal outcomes, upper/lower respiratory disease, wellness, and neurodevelopment totallying over 16k children. | |
Human Health Exposure Analysis Resource (HHEAR): https://hhearprogram.org/ | The HHEAR is a centralized network of exposure analysis services and expertise available to eligible researchers who want to include or broaden exposure analysis to their human health studies. The HHEAR Data Center maintains a repository for HHEAR data including epidemiologic, biomarker and environmental exposure data, and associated data science tools. | |
Cohorts | ||
The All of Us Research Program: https://allofus.nih.gov/ | The All of Us Research Program aims to collect health data from >1 million participants in the USA with a focus on involving previously underrepresented populations. The cohort consists of ∼429 k participants with electronic health records, survey data, genomic data, labs and physical measurements and biospecimens. The surveys include questions on overall health, lifestyle, medical history, and social determinants of health. | |
The UK Biobank: https://www.ukbiobank.ac.uk/ | The UK Biobank is a large-scale biomedical database consisting of ∼500 k participants from the UK with genetic, health and survey data. It is a longitudinal study with health and lifestyle survey data, physical measurements, biospecimens, imaging, electronic health records, biomarkers, wearables and multi-omic data (genotyping, whole genome sequencing and whole exome sequencing). | |
The Million Veteran Program (MVP): https://www.research.va.gov/mvp/ | The MVP investigates the roles of genetics, lifestyle, exposures and military experiences on the health and wellness of Veterans in the USA. The MVP cohort consists of ∼930 k participants with electronic health records, self-reported surveys and genotype data. The surveys comprise of information on health, lifestyle, military experiences and exposure, medical history and diet. | |
The Nurses' Health Study (NHS): https://nurseshealthstudy.org/ | The NHS consists of three prospective cohorts with ∼275 k nurses, primarily female, with questionnaire data and biospecimens. Questionnaires are administered biennially and include questions on health, medical history, lifestyle, diet, behavior, environment and nursing occupational exposures. Biospecimens such as blood, urine, buccal DNA and toenail samples are available for a subset of participants. | |
The Health Professionals Follow-up Study (HFPS): https://www.hsph.harvard.edu/hpfs/ | The HFPS is an all-male study designed to be the complement to the primarily female Nurses Health Study. This study is comprised of ∼22 k males in health professions such as dentists, pharmacists, optometrists, podiatrists, osteopaths, veterinarians, etc. Questionnaires are administered biennially and include questions about diseases such as cancer, heart disease and other vascular diseases, and health-related topics like smoking, physical activity, lifestyle, diet and medications. | |
The National Health and Nutrition Examination Survey (NHANES): https://www.cdc.gov/nchs/nhanes/index.htm | The National Health and Nutrition Examination Survey (NHANES) is a vital program conducted by the CDC in the USA, designed to assess the health and nutritional status of the US population through interviews and physical examinations. NHANES uses a complex, multistage probability design to ensure its sample is representative of the US civilian noninstitutionalized population. NHANES plays a crucial role in exposomic sciences by providing extensive data on environmental exposures, such as various toxins, and contributing to biomonitoring efforts. The data gathered are pivotal for epidemiological studies exploring the relationship between environmental factors and health outcomes. | |
The Personalized Environment and Genes Study (PEGS): https://www.niehs.nih.gov/research/clinical/studies/pegs/ | PEGS integrates genetic and environmental data for ∼10 k racially and ethnically diverse participants and includes multi-dimensional data consisting of phenotypic data, genomic data, and extensive questionnaire-based and geospatial estimates of exposome-wide environmental exposures. The surveys include questions on health, lifestyle, medical history and various exposures such as residential and occupational environmental exposures, medication use, physical activity, stress, sleep, diet and reproductive history. | |
Human Early Life Exposome (HELIX) project: https://helixomics.isglobal.org/ | The HELIX project is a resource of multi-omics and exposome data for 1301 mother-child pairs from six European cohorts. The ExWAS (Exposome-wide association analyses) catalog can be used to query and download findings from the HELIX ExWAS. Summarized results for other omic analyses are also available for download. | |
Location-based exposure sources | ||
The Center for Air, Climate, and Energy Solutions (CACES): https://www.caces.us | The CACES land-use regression (LUR) models provide estimates of outdoor concentrations for multiple pollutants by census tract. The CACES reduced complexity models (RCMs) estimate the impact of the emissions of multiple pollutants on human health. | |
NASA Earthdata Collection: https://www.earthdata.nasa.gov/ | The Earthdata collection includes measurements of the Earth’s atmosphere, land, ocean, and cryosphere from a variety of sources, including sensor data from satellites and aircraft platforms, in situ measurements, field campaigns, and model estimates. These measurements can aid in the understanding of climate change, extreme weather patterns, hazards and disasters, air quality and water resources levels. | |
CDC/ATSDR Social Vulnerability Index (SVI): https://www.atsdr.cdc.gov/placeandhealth/svi/ | The SVI uses US Census data to calculate the social vulnerability at the census tract level (subdivisions of counties for which the Census collects statistical data). Each census tract receives an SVI rank based on 16 social factors which are also grouped into four related themes—socioeconomic status, household characteristics, racial and ethnic minority status, and housing type and transportation. | |
ATSDR Environmental Justice Index: https://www.atsdr.cdc.gov/placeandhealth/eji/ | The EJI ranks the overall effects of environmental injustice on health for each census tract. It ranks each census tract on 36 environmental, social, and health factors and groups them into ten domains and three overarching modules—the environmental burden, social vulnerability and health vulnerability modules. | |
Statistical analysis | ||
rexposome: https://www.bioconductor.org/packages/release/bioc/html/rexposome.html | An R package for the analysis of exposome data. Offers a set of functions to incorporate exposome data into the R framework and a series of tools to analyze exposome data. | |
omicRexposome: https://bioconductor.org/packages/release/bioc/html/omicRexposome.html | omicRexposome uses MultiDataSet for coordinated data management, rexposome for defining exposome data, and limma for association testing to facilitate the study of associations between exposures and omic data. | |
MR-Base: https://www.mrbase.org/ | Platform for Mendelian Randomization using published GWAS summary statistics | |
Chemical Information & Interpretation | ||
Exposome-Explorer: http://exposome-explorer.iarc.fr/ | The Exposome-Explorer is a database of biomarkers of exposure to environmental risk factors for diseases. It contains information on known biomarkers of exposure to dietary factors, pollutants, and contaminants measured in population studies. | |
The Blood Exposome Database: https://bloodexposome.org/ | Collated chemical lists from metabolomics, systems biology, environmental epidemiology, occupation, toxicology and nutrition curated from automated text mining from PubMed and PubChem databases. | |
The Toxic Exposome Database (T3DB): http://www.t3db.ca/ | The database currently houses 3678 toxins described by 41 602 synonyms, including pollutants, pesticides, drugs, and food toxins, which are linked to 2,073 corresponding toxin target records. Altogether there are 42 374 toxin, toxin target associations. Each toxin record (ToxCard) contains over 90 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. | |
CompTox Chemicals Dashboard: https://comptox.epa.gov/dashboard/ | The Dashboard contains chemistry, toxicity and exposure information for over one million chemicals, with over 420 chemical lists based on structure or category. It also enables access to the information in ExpoCast and ToxCast. Notably, one can also get access to EPA's Distributed Structure-Searchable Toxicity (DSSTox) Database, which contains accurate mapping of bioassay and physiochemical data on chemical substances to their chemical structure. | |
PubChem: https://pubchem.ncbi.nlm.nih.gov/ | Open chemistry database of the National Institutes of Health (NIH) since 2004. Small and large molecules with data on structure, identifiers, physiochemical properties, biological activity, as well as health, safety and toxicity data. Currently, it has over 115M compounds. Contributed to by academics, government agencies, chemical vendors and journal publishers. | |
Chemical Entities of Biological Interest (ChEBI): https://www.ebi.ac.uk/chebi/ | ChEBI is a dictionary of molecular entities. It focuses primarily on small chemical compounds that intervene in the biological processes of living organisms. Currently, it has over 60 000 annotated compounds. | |
Tox21: https://ntp.niehs.nih.gov/whatwestudy/tox21/ | Testing of commercial chemicals and pesticides, food additives and chemical compounds in hundreds of cellular based assays and transcriptomic assays with dose-response characterization. |
Epidemiological data science resources for ExWAS analyses and interpretation
Type . | Name . | Description . |
---|---|---|
Epidemiological data source | ||
Data portals | ||
The Trans-Omics for Precision Medicine (TOPMed): https://topmed.nhlbi.nih.gov/ | The TOPMed program consists of ∼180k participants from more than 85 different studies. It consists of ancestrally and ethnically diverse sets of participants, focusing on phenotypes of heart, lung, blood, and sleep disorders with multi-omic data such as whole-genome sequencing (WGS) data and other omics (eg, transcriptomics, epigenomics, metabolomics and proteomics) data integrated with molecular, behavioral, imaging, environmental, and clinical data. | |
Database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap/ | dbGAP is a NIH-maintained database that archives the data and results from genotype-phenotype studies in humans. It contains both open access and controlled access data. Data is organized as studies with phenotype data and genotype data such as SNP assays, methylation data, CNVs, genomic sequencing, exome data, expression arrays, RNA-Seq data, etc. | |
Environmental influences on Child Health Outcomes: https://echochildren.org | The “ECHO Cohort” integrates numerous child cohorts and cover 5 outcomes, including obesity, pre-, peri-, and postnatal outcomes, upper/lower respiratory disease, wellness, and neurodevelopment totallying over 16k children. | |
Human Health Exposure Analysis Resource (HHEAR): https://hhearprogram.org/ | The HHEAR is a centralized network of exposure analysis services and expertise available to eligible researchers who want to include or broaden exposure analysis to their human health studies. The HHEAR Data Center maintains a repository for HHEAR data including epidemiologic, biomarker and environmental exposure data, and associated data science tools. | |
Cohorts | ||
The All of Us Research Program: https://allofus.nih.gov/ | The All of Us Research Program aims to collect health data from >1 million participants in the USA with a focus on involving previously underrepresented populations. The cohort consists of ∼429 k participants with electronic health records, survey data, genomic data, labs and physical measurements and biospecimens. The surveys include questions on overall health, lifestyle, medical history, and social determinants of health. | |
The UK Biobank: https://www.ukbiobank.ac.uk/ | The UK Biobank is a large-scale biomedical database consisting of ∼500 k participants from the UK with genetic, health and survey data. It is a longitudinal study with health and lifestyle survey data, physical measurements, biospecimens, imaging, electronic health records, biomarkers, wearables and multi-omic data (genotyping, whole genome sequencing and whole exome sequencing). | |
The Million Veteran Program (MVP): https://www.research.va.gov/mvp/ | The MVP investigates the roles of genetics, lifestyle, exposures and military experiences on the health and wellness of Veterans in the USA. The MVP cohort consists of ∼930 k participants with electronic health records, self-reported surveys and genotype data. The surveys comprise of information on health, lifestyle, military experiences and exposure, medical history and diet. | |
The Nurses' Health Study (NHS): https://nurseshealthstudy.org/ | The NHS consists of three prospective cohorts with ∼275 k nurses, primarily female, with questionnaire data and biospecimens. Questionnaires are administered biennially and include questions on health, medical history, lifestyle, diet, behavior, environment and nursing occupational exposures. Biospecimens such as blood, urine, buccal DNA and toenail samples are available for a subset of participants. | |
The Health Professionals Follow-up Study (HFPS): https://www.hsph.harvard.edu/hpfs/ | The HFPS is an all-male study designed to be the complement to the primarily female Nurses Health Study. This study is comprised of ∼22 k males in health professions such as dentists, pharmacists, optometrists, podiatrists, osteopaths, veterinarians, etc. Questionnaires are administered biennially and include questions about diseases such as cancer, heart disease and other vascular diseases, and health-related topics like smoking, physical activity, lifestyle, diet and medications. | |
The National Health and Nutrition Examination Survey (NHANES): https://www.cdc.gov/nchs/nhanes/index.htm | The National Health and Nutrition Examination Survey (NHANES) is a vital program conducted by the CDC in the USA, designed to assess the health and nutritional status of the US population through interviews and physical examinations. NHANES uses a complex, multistage probability design to ensure its sample is representative of the US civilian noninstitutionalized population. NHANES plays a crucial role in exposomic sciences by providing extensive data on environmental exposures, such as various toxins, and contributing to biomonitoring efforts. The data gathered are pivotal for epidemiological studies exploring the relationship between environmental factors and health outcomes. | |
The Personalized Environment and Genes Study (PEGS): https://www.niehs.nih.gov/research/clinical/studies/pegs/ | PEGS integrates genetic and environmental data for ∼10 k racially and ethnically diverse participants and includes multi-dimensional data consisting of phenotypic data, genomic data, and extensive questionnaire-based and geospatial estimates of exposome-wide environmental exposures. The surveys include questions on health, lifestyle, medical history and various exposures such as residential and occupational environmental exposures, medication use, physical activity, stress, sleep, diet and reproductive history. | |
Human Early Life Exposome (HELIX) project: https://helixomics.isglobal.org/ | The HELIX project is a resource of multi-omics and exposome data for 1301 mother-child pairs from six European cohorts. The ExWAS (Exposome-wide association analyses) catalog can be used to query and download findings from the HELIX ExWAS. Summarized results for other omic analyses are also available for download. | |
Location-based exposure sources | ||
The Center for Air, Climate, and Energy Solutions (CACES): https://www.caces.us | The CACES land-use regression (LUR) models provide estimates of outdoor concentrations for multiple pollutants by census tract. The CACES reduced complexity models (RCMs) estimate the impact of the emissions of multiple pollutants on human health. | |
NASA Earthdata Collection: https://www.earthdata.nasa.gov/ | The Earthdata collection includes measurements of the Earth’s atmosphere, land, ocean, and cryosphere from a variety of sources, including sensor data from satellites and aircraft platforms, in situ measurements, field campaigns, and model estimates. These measurements can aid in the understanding of climate change, extreme weather patterns, hazards and disasters, air quality and water resources levels. | |
CDC/ATSDR Social Vulnerability Index (SVI): https://www.atsdr.cdc.gov/placeandhealth/svi/ | The SVI uses US Census data to calculate the social vulnerability at the census tract level (subdivisions of counties for which the Census collects statistical data). Each census tract receives an SVI rank based on 16 social factors which are also grouped into four related themes—socioeconomic status, household characteristics, racial and ethnic minority status, and housing type and transportation. | |
ATSDR Environmental Justice Index: https://www.atsdr.cdc.gov/placeandhealth/eji/ | The EJI ranks the overall effects of environmental injustice on health for each census tract. It ranks each census tract on 36 environmental, social, and health factors and groups them into ten domains and three overarching modules—the environmental burden, social vulnerability and health vulnerability modules. | |
Statistical analysis | ||
rexposome: https://www.bioconductor.org/packages/release/bioc/html/rexposome.html | An R package for the analysis of exposome data. Offers a set of functions to incorporate exposome data into the R framework and a series of tools to analyze exposome data. | |
omicRexposome: https://bioconductor.org/packages/release/bioc/html/omicRexposome.html | omicRexposome uses MultiDataSet for coordinated data management, rexposome for defining exposome data, and limma for association testing to facilitate the study of associations between exposures and omic data. | |
MR-Base: https://www.mrbase.org/ | Platform for Mendelian Randomization using published GWAS summary statistics | |
Chemical Information & Interpretation | ||
Exposome-Explorer: http://exposome-explorer.iarc.fr/ | The Exposome-Explorer is a database of biomarkers of exposure to environmental risk factors for diseases. It contains information on known biomarkers of exposure to dietary factors, pollutants, and contaminants measured in population studies. | |
The Blood Exposome Database: https://bloodexposome.org/ | Collated chemical lists from metabolomics, systems biology, environmental epidemiology, occupation, toxicology and nutrition curated from automated text mining from PubMed and PubChem databases. | |
The Toxic Exposome Database (T3DB): http://www.t3db.ca/ | The database currently houses 3678 toxins described by 41 602 synonyms, including pollutants, pesticides, drugs, and food toxins, which are linked to 2,073 corresponding toxin target records. Altogether there are 42 374 toxin, toxin target associations. Each toxin record (ToxCard) contains over 90 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. | |
CompTox Chemicals Dashboard: https://comptox.epa.gov/dashboard/ | The Dashboard contains chemistry, toxicity and exposure information for over one million chemicals, with over 420 chemical lists based on structure or category. It also enables access to the information in ExpoCast and ToxCast. Notably, one can also get access to EPA's Distributed Structure-Searchable Toxicity (DSSTox) Database, which contains accurate mapping of bioassay and physiochemical data on chemical substances to their chemical structure. | |
PubChem: https://pubchem.ncbi.nlm.nih.gov/ | Open chemistry database of the National Institutes of Health (NIH) since 2004. Small and large molecules with data on structure, identifiers, physiochemical properties, biological activity, as well as health, safety and toxicity data. Currently, it has over 115M compounds. Contributed to by academics, government agencies, chemical vendors and journal publishers. | |
Chemical Entities of Biological Interest (ChEBI): https://www.ebi.ac.uk/chebi/ | ChEBI is a dictionary of molecular entities. It focuses primarily on small chemical compounds that intervene in the biological processes of living organisms. Currently, it has over 60 000 annotated compounds. | |
Tox21: https://ntp.niehs.nih.gov/whatwestudy/tox21/ | Testing of commercial chemicals and pesticides, food additives and chemical compounds in hundreds of cellular based assays and transcriptomic assays with dose-response characterization. |
Type . | Name . | Description . |
---|---|---|
Epidemiological data source | ||
Data portals | ||
The Trans-Omics for Precision Medicine (TOPMed): https://topmed.nhlbi.nih.gov/ | The TOPMed program consists of ∼180k participants from more than 85 different studies. It consists of ancestrally and ethnically diverse sets of participants, focusing on phenotypes of heart, lung, blood, and sleep disorders with multi-omic data such as whole-genome sequencing (WGS) data and other omics (eg, transcriptomics, epigenomics, metabolomics and proteomics) data integrated with molecular, behavioral, imaging, environmental, and clinical data. | |
Database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap/ | dbGAP is a NIH-maintained database that archives the data and results from genotype-phenotype studies in humans. It contains both open access and controlled access data. Data is organized as studies with phenotype data and genotype data such as SNP assays, methylation data, CNVs, genomic sequencing, exome data, expression arrays, RNA-Seq data, etc. | |
Environmental influences on Child Health Outcomes: https://echochildren.org | The “ECHO Cohort” integrates numerous child cohorts and cover 5 outcomes, including obesity, pre-, peri-, and postnatal outcomes, upper/lower respiratory disease, wellness, and neurodevelopment totallying over 16k children. | |
Human Health Exposure Analysis Resource (HHEAR): https://hhearprogram.org/ | The HHEAR is a centralized network of exposure analysis services and expertise available to eligible researchers who want to include or broaden exposure analysis to their human health studies. The HHEAR Data Center maintains a repository for HHEAR data including epidemiologic, biomarker and environmental exposure data, and associated data science tools. | |
Cohorts | ||
The All of Us Research Program: https://allofus.nih.gov/ | The All of Us Research Program aims to collect health data from >1 million participants in the USA with a focus on involving previously underrepresented populations. The cohort consists of ∼429 k participants with electronic health records, survey data, genomic data, labs and physical measurements and biospecimens. The surveys include questions on overall health, lifestyle, medical history, and social determinants of health. | |
The UK Biobank: https://www.ukbiobank.ac.uk/ | The UK Biobank is a large-scale biomedical database consisting of ∼500 k participants from the UK with genetic, health and survey data. It is a longitudinal study with health and lifestyle survey data, physical measurements, biospecimens, imaging, electronic health records, biomarkers, wearables and multi-omic data (genotyping, whole genome sequencing and whole exome sequencing). | |
The Million Veteran Program (MVP): https://www.research.va.gov/mvp/ | The MVP investigates the roles of genetics, lifestyle, exposures and military experiences on the health and wellness of Veterans in the USA. The MVP cohort consists of ∼930 k participants with electronic health records, self-reported surveys and genotype data. The surveys comprise of information on health, lifestyle, military experiences and exposure, medical history and diet. | |
The Nurses' Health Study (NHS): https://nurseshealthstudy.org/ | The NHS consists of three prospective cohorts with ∼275 k nurses, primarily female, with questionnaire data and biospecimens. Questionnaires are administered biennially and include questions on health, medical history, lifestyle, diet, behavior, environment and nursing occupational exposures. Biospecimens such as blood, urine, buccal DNA and toenail samples are available for a subset of participants. | |
The Health Professionals Follow-up Study (HFPS): https://www.hsph.harvard.edu/hpfs/ | The HFPS is an all-male study designed to be the complement to the primarily female Nurses Health Study. This study is comprised of ∼22 k males in health professions such as dentists, pharmacists, optometrists, podiatrists, osteopaths, veterinarians, etc. Questionnaires are administered biennially and include questions about diseases such as cancer, heart disease and other vascular diseases, and health-related topics like smoking, physical activity, lifestyle, diet and medications. | |
The National Health and Nutrition Examination Survey (NHANES): https://www.cdc.gov/nchs/nhanes/index.htm | The National Health and Nutrition Examination Survey (NHANES) is a vital program conducted by the CDC in the USA, designed to assess the health and nutritional status of the US population through interviews and physical examinations. NHANES uses a complex, multistage probability design to ensure its sample is representative of the US civilian noninstitutionalized population. NHANES plays a crucial role in exposomic sciences by providing extensive data on environmental exposures, such as various toxins, and contributing to biomonitoring efforts. The data gathered are pivotal for epidemiological studies exploring the relationship between environmental factors and health outcomes. | |
The Personalized Environment and Genes Study (PEGS): https://www.niehs.nih.gov/research/clinical/studies/pegs/ | PEGS integrates genetic and environmental data for ∼10 k racially and ethnically diverse participants and includes multi-dimensional data consisting of phenotypic data, genomic data, and extensive questionnaire-based and geospatial estimates of exposome-wide environmental exposures. The surveys include questions on health, lifestyle, medical history and various exposures such as residential and occupational environmental exposures, medication use, physical activity, stress, sleep, diet and reproductive history. | |
Human Early Life Exposome (HELIX) project: https://helixomics.isglobal.org/ | The HELIX project is a resource of multi-omics and exposome data for 1301 mother-child pairs from six European cohorts. The ExWAS (Exposome-wide association analyses) catalog can be used to query and download findings from the HELIX ExWAS. Summarized results for other omic analyses are also available for download. | |
Location-based exposure sources | ||
The Center for Air, Climate, and Energy Solutions (CACES): https://www.caces.us | The CACES land-use regression (LUR) models provide estimates of outdoor concentrations for multiple pollutants by census tract. The CACES reduced complexity models (RCMs) estimate the impact of the emissions of multiple pollutants on human health. | |
NASA Earthdata Collection: https://www.earthdata.nasa.gov/ | The Earthdata collection includes measurements of the Earth’s atmosphere, land, ocean, and cryosphere from a variety of sources, including sensor data from satellites and aircraft platforms, in situ measurements, field campaigns, and model estimates. These measurements can aid in the understanding of climate change, extreme weather patterns, hazards and disasters, air quality and water resources levels. | |
CDC/ATSDR Social Vulnerability Index (SVI): https://www.atsdr.cdc.gov/placeandhealth/svi/ | The SVI uses US Census data to calculate the social vulnerability at the census tract level (subdivisions of counties for which the Census collects statistical data). Each census tract receives an SVI rank based on 16 social factors which are also grouped into four related themes—socioeconomic status, household characteristics, racial and ethnic minority status, and housing type and transportation. | |
ATSDR Environmental Justice Index: https://www.atsdr.cdc.gov/placeandhealth/eji/ | The EJI ranks the overall effects of environmental injustice on health for each census tract. It ranks each census tract on 36 environmental, social, and health factors and groups them into ten domains and three overarching modules—the environmental burden, social vulnerability and health vulnerability modules. | |
Statistical analysis | ||
rexposome: https://www.bioconductor.org/packages/release/bioc/html/rexposome.html | An R package for the analysis of exposome data. Offers a set of functions to incorporate exposome data into the R framework and a series of tools to analyze exposome data. | |
omicRexposome: https://bioconductor.org/packages/release/bioc/html/omicRexposome.html | omicRexposome uses MultiDataSet for coordinated data management, rexposome for defining exposome data, and limma for association testing to facilitate the study of associations between exposures and omic data. | |
MR-Base: https://www.mrbase.org/ | Platform for Mendelian Randomization using published GWAS summary statistics | |
Chemical Information & Interpretation | ||
Exposome-Explorer: http://exposome-explorer.iarc.fr/ | The Exposome-Explorer is a database of biomarkers of exposure to environmental risk factors for diseases. It contains information on known biomarkers of exposure to dietary factors, pollutants, and contaminants measured in population studies. | |
The Blood Exposome Database: https://bloodexposome.org/ | Collated chemical lists from metabolomics, systems biology, environmental epidemiology, occupation, toxicology and nutrition curated from automated text mining from PubMed and PubChem databases. | |
The Toxic Exposome Database (T3DB): http://www.t3db.ca/ | The database currently houses 3678 toxins described by 41 602 synonyms, including pollutants, pesticides, drugs, and food toxins, which are linked to 2,073 corresponding toxin target records. Altogether there are 42 374 toxin, toxin target associations. Each toxin record (ToxCard) contains over 90 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. | |
CompTox Chemicals Dashboard: https://comptox.epa.gov/dashboard/ | The Dashboard contains chemistry, toxicity and exposure information for over one million chemicals, with over 420 chemical lists based on structure or category. It also enables access to the information in ExpoCast and ToxCast. Notably, one can also get access to EPA's Distributed Structure-Searchable Toxicity (DSSTox) Database, which contains accurate mapping of bioassay and physiochemical data on chemical substances to their chemical structure. | |
PubChem: https://pubchem.ncbi.nlm.nih.gov/ | Open chemistry database of the National Institutes of Health (NIH) since 2004. Small and large molecules with data on structure, identifiers, physiochemical properties, biological activity, as well as health, safety and toxicity data. Currently, it has over 115M compounds. Contributed to by academics, government agencies, chemical vendors and journal publishers. | |
Chemical Entities of Biological Interest (ChEBI): https://www.ebi.ac.uk/chebi/ | ChEBI is a dictionary of molecular entities. It focuses primarily on small chemical compounds that intervene in the biological processes of living organisms. Currently, it has over 60 000 annotated compounds. | |
Tox21: https://ntp.niehs.nih.gov/whatwestudy/tox21/ | Testing of commercial chemicals and pesticides, food additives and chemical compounds in hundreds of cellular based assays and transcriptomic assays with dose-response characterization. |
It can be useful to teach ExWAS via GWAS, which is a hypothesis-free approach, like ExWAS, to identify genetic factors associated with outcomes23 (eg, G-P correlations), or high-throughput inference. They both use regression methods to identify factors associated with outcome. They both use similar summary statistics (eg, odds ratios, correlations) to convey the relationship between the G-P or E-P. Finally, the exposome or genome as a whole can be related to phenotype via predictive approaches, where the summary statistics include total variance explained or area under a receiver operating characteristic curve (AUC). Connection between aggregate summary statistics, AUC, and attributable fraction in genetics research are possible. In genetics research, the aggregate risk, or “architecture” (the total spectrum of disease risk along the genome) are described in terms of “frequency of genetic variant” and effect size (eg, odds ratio), so investigators can visualize the risk of disease relative to how frequent a risk factor is.
How would one articulate the “architecture” of exposome-phenotype associations? As of this writing, the architecture of the exposome is articulated in terms of effect size versus signal to noise, or p value of single associations. Furthermore, genetic variants and exposures differ, in its dynamism and modality24: environmental exposures can change over time and space, which could depend on many factors including an individual's behavior and lifestyle. Therefore, the “architecture” as ever-changing through the life course. Genetic variants, in contrast, are static, as they are inherited at conception and remain relatively constant throughout an individual’s life. Their architecture may change in the presence of environmental factors. In GWAS, genetic variants can be measured with high accuracy using well-established genotyping technologies,25 whereas the technologies to measure the chemical exposome, such as organic chemicals (DDT, PCBs, and PBDEs et cetera) and metals (Pb, Cd, and Cr et cetera), are generally involved using targeted methods with mass spectrometry. Until recently, targeted and untargeted mass spectrometry have been evaluated for measuring the exposome.
Study designs for exposomics research
There are a number of study designs to elucidate the role of an exposure to a phenotype (Table 2). In region-wide or nationwide scale biobanks, subsamples can be extracted based on these basic designs to answer the research question. Since 2010, more than 60 studies applying the ExWAS approach have been published. These studies were designed to identify exposures associated with chronic diseases such as childhood obesity,26 dementia,17 coronary heart disease,27 and autism.28,29 It is also used to study exposures related to other outcomes such as mental well-being,18 depression,30 coffee consumption,31 COVID-19,32,33 and child behavior34 (Table 3). Across these studies, single cohort, integrated cohorts, surveys, and biobank samples were used and sample sizes were between ∼1000 to ∼500 000. Typically, the number of external exposures under investigations was between 50 to 200 (in 1 case >900), and exposures were often classified into different categories to aid interpretation. Some studies assessed all the available environmental factors in the datasets, while others only selected a subset of the exposures, for instance, dietary exposures and/or other modifying factors.
Common epidemiological study designs and their advantages and disadvantages for exposomic studies
Study design . | Description . | Advantages . | Disadvantages . |
---|---|---|---|
Cross sectional | For this design, data on the exposome and health outcomes are collected at a single point in time. It can provide a snapshot of the relationship between exposomic factors and health outcomes in a specific population. | Suitable for routine data collection and able to estimate population features such as prevalence of a disease. | Reverse causality: exposome factors coming before the outcome. Confounding: the observed association between an environmental exposure (the exposome factor) and a health outcome is distorted by the presence of another variable. In exposomic research, confounding can be particularly challenging due to the complex and multifaceted nature of environmental exposures. |
Case control | It involves comparing the exposure history of individuals with a specific disease or health outcome (cases) to those without the outcome (controls). Cases are enrolled first and controls with similar demographic and other key characteristics as the cases are collected in the same population. | Relative simple and inexpensive to collect samples and able to conduct analysis to identify exposures associated with the disease. It is an efficient design for studying rare diseases. | Confounding by unknown factors (see above). |
Cohort | In cohort studies, a group of individuals (cohort) is followed over time to assess the relationship between exposures and health outcomes. These studies can be prospective (following individuals forward in time) or retrospective (using existing data to follow individuals backward in time). | Particularly useful for studying the effects of long-term and multiple exposures, as well as investigating the role of critical periods and windows of susceptibility in life-course epidemiology. | Can be time-consuming, expensive, and may be affected by attribution bias. Confounding also remains an issue |
Nested case-control | This design is a hybrid of cohort and case-control designs, where cases and controls are identified from within an existing cohort study. | Has the advantages of case-control design and lower cost of exposure measurement due to a reduced sample size. | May induce inefficiency when matching cases. When multiple outcomes are investigated, a new set of controls is required for each disease. |
Study design . | Description . | Advantages . | Disadvantages . |
---|---|---|---|
Cross sectional | For this design, data on the exposome and health outcomes are collected at a single point in time. It can provide a snapshot of the relationship between exposomic factors and health outcomes in a specific population. | Suitable for routine data collection and able to estimate population features such as prevalence of a disease. | Reverse causality: exposome factors coming before the outcome. Confounding: the observed association between an environmental exposure (the exposome factor) and a health outcome is distorted by the presence of another variable. In exposomic research, confounding can be particularly challenging due to the complex and multifaceted nature of environmental exposures. |
Case control | It involves comparing the exposure history of individuals with a specific disease or health outcome (cases) to those without the outcome (controls). Cases are enrolled first and controls with similar demographic and other key characteristics as the cases are collected in the same population. | Relative simple and inexpensive to collect samples and able to conduct analysis to identify exposures associated with the disease. It is an efficient design for studying rare diseases. | Confounding by unknown factors (see above). |
Cohort | In cohort studies, a group of individuals (cohort) is followed over time to assess the relationship between exposures and health outcomes. These studies can be prospective (following individuals forward in time) or retrospective (using existing data to follow individuals backward in time). | Particularly useful for studying the effects of long-term and multiple exposures, as well as investigating the role of critical periods and windows of susceptibility in life-course epidemiology. | Can be time-consuming, expensive, and may be affected by attribution bias. Confounding also remains an issue |
Nested case-control | This design is a hybrid of cohort and case-control designs, where cases and controls are identified from within an existing cohort study. | Has the advantages of case-control design and lower cost of exposure measurement due to a reduced sample size. | May induce inefficiency when matching cases. When multiple outcomes are investigated, a new set of controls is required for each disease. |
Common epidemiological study designs and their advantages and disadvantages for exposomic studies
Study design . | Description . | Advantages . | Disadvantages . |
---|---|---|---|
Cross sectional | For this design, data on the exposome and health outcomes are collected at a single point in time. It can provide a snapshot of the relationship between exposomic factors and health outcomes in a specific population. | Suitable for routine data collection and able to estimate population features such as prevalence of a disease. | Reverse causality: exposome factors coming before the outcome. Confounding: the observed association between an environmental exposure (the exposome factor) and a health outcome is distorted by the presence of another variable. In exposomic research, confounding can be particularly challenging due to the complex and multifaceted nature of environmental exposures. |
Case control | It involves comparing the exposure history of individuals with a specific disease or health outcome (cases) to those without the outcome (controls). Cases are enrolled first and controls with similar demographic and other key characteristics as the cases are collected in the same population. | Relative simple and inexpensive to collect samples and able to conduct analysis to identify exposures associated with the disease. It is an efficient design for studying rare diseases. | Confounding by unknown factors (see above). |
Cohort | In cohort studies, a group of individuals (cohort) is followed over time to assess the relationship between exposures and health outcomes. These studies can be prospective (following individuals forward in time) or retrospective (using existing data to follow individuals backward in time). | Particularly useful for studying the effects of long-term and multiple exposures, as well as investigating the role of critical periods and windows of susceptibility in life-course epidemiology. | Can be time-consuming, expensive, and may be affected by attribution bias. Confounding also remains an issue |
Nested case-control | This design is a hybrid of cohort and case-control designs, where cases and controls are identified from within an existing cohort study. | Has the advantages of case-control design and lower cost of exposure measurement due to a reduced sample size. | May induce inefficiency when matching cases. When multiple outcomes are investigated, a new set of controls is required for each disease. |
Study design . | Description . | Advantages . | Disadvantages . |
---|---|---|---|
Cross sectional | For this design, data on the exposome and health outcomes are collected at a single point in time. It can provide a snapshot of the relationship between exposomic factors and health outcomes in a specific population. | Suitable for routine data collection and able to estimate population features such as prevalence of a disease. | Reverse causality: exposome factors coming before the outcome. Confounding: the observed association between an environmental exposure (the exposome factor) and a health outcome is distorted by the presence of another variable. In exposomic research, confounding can be particularly challenging due to the complex and multifaceted nature of environmental exposures. |
Case control | It involves comparing the exposure history of individuals with a specific disease or health outcome (cases) to those without the outcome (controls). Cases are enrolled first and controls with similar demographic and other key characteristics as the cases are collected in the same population. | Relative simple and inexpensive to collect samples and able to conduct analysis to identify exposures associated with the disease. It is an efficient design for studying rare diseases. | Confounding by unknown factors (see above). |
Cohort | In cohort studies, a group of individuals (cohort) is followed over time to assess the relationship between exposures and health outcomes. These studies can be prospective (following individuals forward in time) or retrospective (using existing data to follow individuals backward in time). | Particularly useful for studying the effects of long-term and multiple exposures, as well as investigating the role of critical periods and windows of susceptibility in life-course epidemiology. | Can be time-consuming, expensive, and may be affected by attribution bias. Confounding also remains an issue |
Nested case-control | This design is a hybrid of cohort and case-control designs, where cases and controls are identified from within an existing cohort study. | Has the advantages of case-control design and lower cost of exposure measurement due to a reduced sample size. | May induce inefficiency when matching cases. When multiple outcomes are investigated, a new set of controls is required for each disease. |
Study . | Data source . | Study design . | Sample size . | Modality of exposures . | Number of exposures . | Phenotype . | Model . | Summary Statistic . | Multiplicity Control Method . |
---|---|---|---|---|---|---|---|---|---|
Zhang et al. | UK Biobank | Cohort | Over 500 000 | 7 categories of factors, eg, lifestyle, medical history, and socioeconomic status | 210 modifiable factors | Dementia | Cox proportional hazard regression | Hazard ratio | Bonferroni |
Shah et al. | Coronary Artery Risk Development in Young Adults (CARDIA) | Cohort | 5 115 | 17 Food and beverage groups, 2 nutrient groups, and metabolome | 19 dietary factors and metabolome | Cardiometabolic-cardiovascular disease | Linear regression | Beta coefficient | FDR |
van de Weijer et al. | Geoscience and Health Cohort Consortium and the Netherlands Twin Register | Longitudinal national register | 21 926 | Multiple domains including social, physical, and demographic variables | 133 variables | Well-being | Generalized estimating equations | Beta coefficient | Bonferroni |
Saberi Hosnijeh et al. | European Prospective Investigation into Nutrition and Cancer cohort (EPIC) | Cohort | 475 426 | Various anthropometry measures and lifestyle factors | 84 variables | B-cell lymphoma | Cox proportional hazard regression | Hazard ratio | FDR |
Julvez et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1298 mother-child pairs | 19 categories of factors such as metals, built environment, and traffic | 209 variables | Child cognitive function | Linear regression | Beta coefficient | Bonferroni |
Jedynak et al. | 5 different European cohorts | Cohort | 708 mother-child pairs | Environmental chemicals | 47 variables | Child behavior | Negative binomial regression | Incidence rate ratio | Bonferroni |
Uche et al. | NHANES 1999–2016 | Multiple cross-sectional surveys | 50 048 | All available and useable environmental factors such as metals, toxins, allergens, and nutrients | more than 200 variables | Obesity | Logistic regression | Odds ratio | FDR |
Milanlouei et al. | Nurses’ Health Study | Cohort | 62 811 | Dietary factors | 257 nutrients and 117 foods | Coronary heart disease | Cox proportional hazard regression | Hazard ratio | FDR |
Granum et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1270 mother-child pairs | 18 exposure groups during pregnancy and at the subcohort follow-up, such as built environment, air pollution, traffic, and road traffic noise | 197 variables | Allergy related outcomes in childhood | Logistic regression | Odds ratio | Bonferroni |
Elhadad et al. | NHANES III (1988–1994) | Multiple cross-sectional surveys | 17 752 | Metabolites, nutrients, and lifestyle factors | 245 variables | Coffee consumption | Linear regression | Beta coefficient | FDR |
Choi et al. | UK Biobank | Cohort | over 100 000 | Modifiable factors such as behavioral, social, and environmental | 106 variables | Depression | Logistic regression | Odds ratio | Bonferroni |
Study . | Data source . | Study design . | Sample size . | Modality of exposures . | Number of exposures . | Phenotype . | Model . | Summary Statistic . | Multiplicity Control Method . |
---|---|---|---|---|---|---|---|---|---|
Zhang et al. | UK Biobank | Cohort | Over 500 000 | 7 categories of factors, eg, lifestyle, medical history, and socioeconomic status | 210 modifiable factors | Dementia | Cox proportional hazard regression | Hazard ratio | Bonferroni |
Shah et al. | Coronary Artery Risk Development in Young Adults (CARDIA) | Cohort | 5 115 | 17 Food and beverage groups, 2 nutrient groups, and metabolome | 19 dietary factors and metabolome | Cardiometabolic-cardiovascular disease | Linear regression | Beta coefficient | FDR |
van de Weijer et al. | Geoscience and Health Cohort Consortium and the Netherlands Twin Register | Longitudinal national register | 21 926 | Multiple domains including social, physical, and demographic variables | 133 variables | Well-being | Generalized estimating equations | Beta coefficient | Bonferroni |
Saberi Hosnijeh et al. | European Prospective Investigation into Nutrition and Cancer cohort (EPIC) | Cohort | 475 426 | Various anthropometry measures and lifestyle factors | 84 variables | B-cell lymphoma | Cox proportional hazard regression | Hazard ratio | FDR |
Julvez et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1298 mother-child pairs | 19 categories of factors such as metals, built environment, and traffic | 209 variables | Child cognitive function | Linear regression | Beta coefficient | Bonferroni |
Jedynak et al. | 5 different European cohorts | Cohort | 708 mother-child pairs | Environmental chemicals | 47 variables | Child behavior | Negative binomial regression | Incidence rate ratio | Bonferroni |
Uche et al. | NHANES 1999–2016 | Multiple cross-sectional surveys | 50 048 | All available and useable environmental factors such as metals, toxins, allergens, and nutrients | more than 200 variables | Obesity | Logistic regression | Odds ratio | FDR |
Milanlouei et al. | Nurses’ Health Study | Cohort | 62 811 | Dietary factors | 257 nutrients and 117 foods | Coronary heart disease | Cox proportional hazard regression | Hazard ratio | FDR |
Granum et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1270 mother-child pairs | 18 exposure groups during pregnancy and at the subcohort follow-up, such as built environment, air pollution, traffic, and road traffic noise | 197 variables | Allergy related outcomes in childhood | Logistic regression | Odds ratio | Bonferroni |
Elhadad et al. | NHANES III (1988–1994) | Multiple cross-sectional surveys | 17 752 | Metabolites, nutrients, and lifestyle factors | 245 variables | Coffee consumption | Linear regression | Beta coefficient | FDR |
Choi et al. | UK Biobank | Cohort | over 100 000 | Modifiable factors such as behavioral, social, and environmental | 106 variables | Depression | Logistic regression | Odds ratio | Bonferroni |
Study . | Data source . | Study design . | Sample size . | Modality of exposures . | Number of exposures . | Phenotype . | Model . | Summary Statistic . | Multiplicity Control Method . |
---|---|---|---|---|---|---|---|---|---|
Zhang et al. | UK Biobank | Cohort | Over 500 000 | 7 categories of factors, eg, lifestyle, medical history, and socioeconomic status | 210 modifiable factors | Dementia | Cox proportional hazard regression | Hazard ratio | Bonferroni |
Shah et al. | Coronary Artery Risk Development in Young Adults (CARDIA) | Cohort | 5 115 | 17 Food and beverage groups, 2 nutrient groups, and metabolome | 19 dietary factors and metabolome | Cardiometabolic-cardiovascular disease | Linear regression | Beta coefficient | FDR |
van de Weijer et al. | Geoscience and Health Cohort Consortium and the Netherlands Twin Register | Longitudinal national register | 21 926 | Multiple domains including social, physical, and demographic variables | 133 variables | Well-being | Generalized estimating equations | Beta coefficient | Bonferroni |
Saberi Hosnijeh et al. | European Prospective Investigation into Nutrition and Cancer cohort (EPIC) | Cohort | 475 426 | Various anthropometry measures and lifestyle factors | 84 variables | B-cell lymphoma | Cox proportional hazard regression | Hazard ratio | FDR |
Julvez et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1298 mother-child pairs | 19 categories of factors such as metals, built environment, and traffic | 209 variables | Child cognitive function | Linear regression | Beta coefficient | Bonferroni |
Jedynak et al. | 5 different European cohorts | Cohort | 708 mother-child pairs | Environmental chemicals | 47 variables | Child behavior | Negative binomial regression | Incidence rate ratio | Bonferroni |
Uche et al. | NHANES 1999–2016 | Multiple cross-sectional surveys | 50 048 | All available and useable environmental factors such as metals, toxins, allergens, and nutrients | more than 200 variables | Obesity | Logistic regression | Odds ratio | FDR |
Milanlouei et al. | Nurses’ Health Study | Cohort | 62 811 | Dietary factors | 257 nutrients and 117 foods | Coronary heart disease | Cox proportional hazard regression | Hazard ratio | FDR |
Granum et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1270 mother-child pairs | 18 exposure groups during pregnancy and at the subcohort follow-up, such as built environment, air pollution, traffic, and road traffic noise | 197 variables | Allergy related outcomes in childhood | Logistic regression | Odds ratio | Bonferroni |
Elhadad et al. | NHANES III (1988–1994) | Multiple cross-sectional surveys | 17 752 | Metabolites, nutrients, and lifestyle factors | 245 variables | Coffee consumption | Linear regression | Beta coefficient | FDR |
Choi et al. | UK Biobank | Cohort | over 100 000 | Modifiable factors such as behavioral, social, and environmental | 106 variables | Depression | Logistic regression | Odds ratio | Bonferroni |
Study . | Data source . | Study design . | Sample size . | Modality of exposures . | Number of exposures . | Phenotype . | Model . | Summary Statistic . | Multiplicity Control Method . |
---|---|---|---|---|---|---|---|---|---|
Zhang et al. | UK Biobank | Cohort | Over 500 000 | 7 categories of factors, eg, lifestyle, medical history, and socioeconomic status | 210 modifiable factors | Dementia | Cox proportional hazard regression | Hazard ratio | Bonferroni |
Shah et al. | Coronary Artery Risk Development in Young Adults (CARDIA) | Cohort | 5 115 | 17 Food and beverage groups, 2 nutrient groups, and metabolome | 19 dietary factors and metabolome | Cardiometabolic-cardiovascular disease | Linear regression | Beta coefficient | FDR |
van de Weijer et al. | Geoscience and Health Cohort Consortium and the Netherlands Twin Register | Longitudinal national register | 21 926 | Multiple domains including social, physical, and demographic variables | 133 variables | Well-being | Generalized estimating equations | Beta coefficient | Bonferroni |
Saberi Hosnijeh et al. | European Prospective Investigation into Nutrition and Cancer cohort (EPIC) | Cohort | 475 426 | Various anthropometry measures and lifestyle factors | 84 variables | B-cell lymphoma | Cox proportional hazard regression | Hazard ratio | FDR |
Julvez et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1298 mother-child pairs | 19 categories of factors such as metals, built environment, and traffic | 209 variables | Child cognitive function | Linear regression | Beta coefficient | Bonferroni |
Jedynak et al. | 5 different European cohorts | Cohort | 708 mother-child pairs | Environmental chemicals | 47 variables | Child behavior | Negative binomial regression | Incidence rate ratio | Bonferroni |
Uche et al. | NHANES 1999–2016 | Multiple cross-sectional surveys | 50 048 | All available and useable environmental factors such as metals, toxins, allergens, and nutrients | more than 200 variables | Obesity | Logistic regression | Odds ratio | FDR |
Milanlouei et al. | Nurses’ Health Study | Cohort | 62 811 | Dietary factors | 257 nutrients and 117 foods | Coronary heart disease | Cox proportional hazard regression | Hazard ratio | FDR |
Granum et al. | A multi-centric birth cohort study in 6 European countries | Cohort | 1270 mother-child pairs | 18 exposure groups during pregnancy and at the subcohort follow-up, such as built environment, air pollution, traffic, and road traffic noise | 197 variables | Allergy related outcomes in childhood | Logistic regression | Odds ratio | Bonferroni |
Elhadad et al. | NHANES III (1988–1994) | Multiple cross-sectional surveys | 17 752 | Metabolites, nutrients, and lifestyle factors | 245 variables | Coffee consumption | Linear regression | Beta coefficient | FDR |
Choi et al. | UK Biobank | Cohort | over 100 000 | Modifiable factors such as behavioral, social, and environmental | 106 variables | Depression | Logistic regression | Odds ratio | Bonferroni |
Characteristics of ExWAS
Multi-modality of exposure data
One of the key characteristics of exposomics data is “multi-modality” of measurement, meaning that it encompassess multiple types and contexts of information.35,36 Examples of these measurements include light and temperature (sub-molecular), biomarkers of chemicals (molecular), dietary intake and physical activity (lifestyle), income and education (socioeconomic status).37,38 Large nationwide studies of the external exposures usually require integration through ZIP Code Tabulation Areas (areal representations of postal ZIP Codes), while the analysis of the internal exposome in the context of precision medicine involves multi-omics data such as genomics, transcriptomics, proteomics, and metabolomics.39
How large is the exposome? The “dimensionality,” or how many exposome variables are included in an ExWAS has analytic implications, such as signal-to-noise and false positive rates. Currently, the largest chemical database has over 275 million substances40 but only make up a tiny fraction of the theoretical range (millions of billions).4 Exposure information is commonly obtained through geospatial modeling, laboratory measurement, questionnaire, or administrative records. Nevertheless, from a data analytical perspective, these diverse factors are viewed as one of the following types: categorical variables with nominal and ordinal subtypes, or numeric variables with interval and ratio subtypes. Raw data captured in numeric format can be encoded as it is, or recoded into categorical variables. For instance, one may record the number of cigarettes smoked per week and recode it into a variable with three categories: heavy, medium, and light smokers. While the best encoding choice depends on the context of the study, it is generally recommended to record data in the native format to avoid loss of information and biases.41-43 The types of outcome variables can affect the choice of regression model. A logistic regression model and a linear regression model are used to analyze binary and continuous outcomes respectively. Similarly, the types of predictor variables can affect the interpretation of the beta coefficients—whether it is an increase in outcome per unit change of a continuous predictor, or a change in outcome relative to the reference group of a categorical variable.
Correlations between exposures and phenotypes
Environmental exposures are known to be densely correlated.24,44,45 Correlation patterns depend highly on the context of the analysis. Chemicals released to the environment from a single source or generated from the same biochemical processes (eg, diesel combustion) are correlated and often detected as a cluster, and have been used as the footprint to identify sources of exposures.46,47 Organic chemicals tend to have higher correlation than water-soluble compounds due to their lipophilicity and accumulation in organisms. Correlation of exposures are also higher between unit members in a shared environment, and this correlation increases further with longer duration of residence.48,49 One of the methods to intuitively visualize the correlation structure is correlation globe,50 which can be further developed to show differences between and within sex groups51 (Figure 3). In longitudinal studies, the within-person correlation of repeated measurements of the same exposures is generally higher than the between-person correlations; however, other factors, such as the solubility and exposure trends of the chemicals, could play an important role for this observation.52,53 The consequence is that it is difficult to identify the true contributor(s) for a given outcome in a statistical model. It also causes instability to model parameters and their precision (standard errors) from tiny changes in the input data due to multicollinearity.54 In ExWAS and exposome research, it is essential to check model assumptions, potentially across multiple exposures. Further, correlation decreases the effective sample size and statistical power of an analysis.

A correlation globe showing the associations among chemical biomarkers for females, males, and couples. The right half of the globe represents female biomarkers, while the left half represents male biomarkers. Only correlations greater than 0.25 or smaller than −0.25 are displayed as connections. A red line signifies a positive correlation, whereas a dark green line represents a negative correlation. Both color intensity and line width correspond to the magnitude of the correlation. Reproduced from Chung et al.,51 used under CC-BY-NC-ND 4.0 license.
The list of potential confounding variables is elusive and may need to be considered in a domain-specific manner
As above, the exposome is densely correlated. A related issue, confounding, is common in exposomic observational studies. Confounding in the context of the exposome refers to a situation where the observed association between an environmental exposure (the exposome factor) and a health outcome is distorted by the presence of another variable, which is related to both the exposure and the outcome but is not an intermediate step in the causal pathway.
Potential confounders can influence both the exposure and the outcome variables and cause spurious associations in analysis, which can be controlled by including them in a regression model, or by stratifying among the hypothetical confounding variables. Nonetheless, a database of confounders is not available for ExWAS associations. There are several reasons for this, possibly. First, the phenomenon of confounding may differ from domain-to-domain, requiring analytic specifications to change for each association. Second, since the exposome is time-varying, there could be many types of sources of confounding, many of which have not been identified. The exposome includes a vast array of factors across domains. This contrasts with GWAS, where “one” central confounder has been identified, known as population stratification, which describes how genetic variant frequency relates to ancestry. Untangling ancestry versus variant specific effects is achieved by stratifying GWAS analysis by ancestral groups, or accounting for ancestry in the regression model, and this adjustment does not need to change per genetic factor.
We give an example: suppose a ExWAS discovers a correlation between a certain environmental chemical and increased rates of a health condition. However, if individuals with the exposure also share a common lifestyle factor (like smoking), which independently increases the risk of the health condition, smoking becomes a confounding variable. It’s challenging to determine whether the health condition is due to the chemical, the smoking, or a combination of both. To partially address confounding in exposomic studies, comprehensive data collection is vital. This includes detailed information on a wide range of potential environmental exposures, as well as other demographic, genetic, and lifestyle factors that could influence health outcomes.
Sparsity and missingness of exposure data
The size of the chemical space (possible chemical species in the environment) is increasing.55 However, individuals are typically exposed to a small subset of this universe. If we measure the chemicals in human fluid or environmental samples and tabulate the results, low dose exposures will be the majority and many of the values in the data table for a specific chemical will be “missing” from a subset of samples, and have no values.56-58 The sparsity could be caused by a lack of exposure or concentration that is too low to be detected, that is left censored data. For chronic diseases, it is often assumed that the individual effects of many exposures are marginal and impacts are attributed to the collective actions of a mixture57,59,60; however, deployment of modern machine learning techniques to analyze mixtures is impossible with non-random patterns of missingness. Left-censoring is one of the examples of missing not at random and is characterized by missing values that are below the limitation of detection of the measurement. An imputation method, Quantile Regression Imputation of Left-Censored data (QRILC), has been developed to impute unknown values.61 The method works by sampling randomly from a truncated distribution of values predicted via quantile regression.
Data processing
Careful data preprocessing is essential given multi-modality is critical in an exposome study. Pre-processing refers to the various addition, removal, and transformation actions taken to make raw data ready for statistical analysis and influences the interpretation and robustness of ExWAS outputs.
Assembling the cohort for analysis
Data for a study are collected as specified by research protocols. Additional filtering steps to select eligible subjects into an analysis are common for large, general purpose observational studies or repurposing real-world data (eg, administrative healthcare records). If the analysis requires the integration of multiple independent datasets, variable harmonization is needed to increase comparability and interpretability of results.62-64 Approach for harmonization generally involves assessing data dictionaries to identify common variables that have different recording formats. Then new variables are created through standardizing the measurement units and redefining levels in categorical variables. For instance, in one study, “ethnicity” is a categorical variable with levels “white,” “black,” “Hispanic,” and “others.” While in another study, the same variable contains “white,” “black,” “Hispanic,” “Asian,” and “others.” For compatibility, some investigators may merge “asian” into “others” for the new “ethnicity” variable and document how inconsistency is handled in the data dictionary.
Data cleaning
When the information is gathered into a structured, tabular format, it is further processed to facilitate downstream statistical analysis. Variables (columns) with a high percentage of missing values, eg, >90%, could be removed to enhance reliability of analysis. For extreme values, data could be trimmed to remove a small percentage of subjects (rows) or replaced with a highest manually decided value for the variable, eg, age 99 can be substituted for age 137. However, trimming should be performed with caution to control for the level of biases introduced to the data.60,65 Datasets with missing values are common across different fields in exposomics and missingness is caused by various reasons, from dropout of subjects in longitudinal analysis to beyond the reliable signal detection range in chemical measurement. Furthermore, when integrating data from different modalities (eg, different assays), often some assays will be measured and not others for a subset of the participants. Different imputation methods are available and the choice is typically based on the investigator’s knowledge about the missing mechanism (eg, missing completely at random, missing at random, censored), types of analysis and data, and imputation performance. In summary, data cleaning steps may include procedures such as handling missing values, detecting and managing outliers, removing duplicates, and binning variables. The aim is to enhance the quality of data, ensuring both internal validity (plausible values and ranges) and external validity (comparable units). This, in turn, enhances the interpretability of downstream statistical analyses.
Transforming variables
Often, data are transformed prior to analysis to adhere to model assumptions and to enhance result interpretation. It is particularly important in exposome analyses where different exposures will have different units of exposure and differing prevalence of exposure. In a linear model, predictor (X variable) can be log-transformed to reduce the influence of extremely large values without trimming or substitution, whereas the same transformation is applied to log-normally distributed outcome (Y variable) as a simple way to fulfill the normality assumption of errors. A “fudge factor,” for example, +1, is added to zero value when log transformation is required.29 For other non-normally distributed variables, Box-Cox transformation66,67 or an inverse normalized function68 are options. Since multimodal data could have variables with different units and a large range of absolute magnitude, we can conduct z-score standardization69 to make comparisons possible (eg, each variable is in 1 SD unit of continuous exposure). In the machine learning context, categorical variables are often required to be one-hot encoded (ie, transforming the levels of a categorical variable into new individual variables) prior to modeling.
Other considerations
Analytic outputs: effect sizes, correlation, and odds ratios
In statistics, an effect size, or association size, quantitates the relationship between two variables. A simple example is correlation, which measures the relatedness between two variables, that is their tendency to vary together. Typical quantification metrics include Pearson’s product-moment coefficient (r) for linear relationship and more generally Spearman’s rank correlation coefficient (rs) for any monotonic relationship. It is a unitless measure with a range of −1 to 1. A negative correlation suggests that variables are changing in opposite directions, and a zero correlation means there is no relationship. In ExWAS, absolute correlations are usually low, but heterogeneous, with values below 0.25.50,51,70 When comparing the effect sizes of two events, we can use absolute measures including mean risk difference, or relative terms such as relative risk and odds ratios. A magnitude equal to one for the relative measures indicates that the exposure does not affect the outcome, while a value smaller than one means a protective effect from the exposure, and a harmful effect for a bigger than one value. In a linear regression model, a regression coefficient denotes the average change in dependent variable per unit change in the corresponding independent variable.
Inference versus prediction
A statistical model is built to describe the relationships between the variables of interest and can be used to draw inference or to make predictions. The majority of public health studies focus on collecting representative samples from a population and constructing models in order to draw conclusions to support policy formulation. Conversely, in biomedical studies, models are often used to predict the outcomes of individuals through using their corresponding measurements as predictors. The predicted outcome can further aid in diagnosis and prognosis of diseases.
In order to build a regression model for inference in an ExWAS study, researchers first need to choose the right regression model to model a quantitative or binary outcome. Second, the analyst must decide if they want to transform the outcome variable so it adheres to the requirements of regression (eg, normally distributed for continuous outcomes). Then, one needs to identify suitable exposure and outcome variables for the study question, and include other potentially confounding variables based on domain knowledge to minimize distortion of the association between the variables of interest. For instance, secondhand smoke exposure was adjusted for the associations between short-term ozone exposure and platelet activation and blood pressure increases.71 Next, a model is chosen based on the nature of the data and the hypothesis. Before fitting the model, the data must be inspected and cleaned to ensure a valid interpretation of the modeling statistics. The model is usually fitted using ordinary least squares or maximum likelihood estimation methods. Afterward, it is necessary to check model assumptions such as normality and homoscedasticity through diagnostic tests and plots for every exposure phenotype association.
In predictive modeling, data are split into training and test sets with a ratio between 8:2 to 5:5. Using the training set, variables that have strong influence on the outcome are selected as predictors, and fit statistics such as R2 is calculated to assess how well the model describes the data. Such a model can be optimized iteratively, and the performance of candidate models is often evaluated through cross-validation, which is a resampling procedure utilizing different segments of the training dataset for testing and tuning a model across multiple iterations.72 Increasingly, many complex machine learning algorithms are available, and they are often treated as “black boxes” when compared with linear-based regression models. Most of the time, research questions involve binary classification and the prediction characteristics are visualized with a receiver operating characteristic (ROC) curve.73 Alternative models can be evaluated using performance indicators such as AUC. Model overfitting can occur and the generalizability of its predictive performance is assessed using the test dataset. The concept of bias-variance tradeoff emphasizes the importance of finding an optimal balance between simplicity (to prevent overfitting) and accuracy (to effectively capture the underlying patterns in the data) in a machine learning model.74 No simple solution is known for this problem, but techniques such as cross-validation could help to detect it early in the model building process.
Variable selection, reproducibility, and mitigation of the exposome-wide false discovery rate
In exposome research and ExWAS, the analyst is attempting to relate a vast array of environmental factors with an outcome, known as variable selection. However, the more tests in ExWAS, the higher the chance that some of the findings (indicating a significant effect) are actually just due to random chance. This is known as a false positive or a false discovery. False positives are a threat to reproducibility of associations. Traditional high-throughput inference techniques include the Bonferroni correction (simultaneous inference) or FDR control (inferring over the average of those that are selected). The prior adjusts the p value thresholds conservatively, attempting to reduce the probability of selecting one false positive. The former is more lenient, trying to ensure that the average of false discoveries among all variables selected is controlled.
Practically, these approaches address multiple hypothesis tests by correcting, or adjusting the significance threshold to account for multiple testing; however these estimates make inferences without accounting for the other exposome variables that are associated with the outcome. For example, the Bonferroni correction, known as a “family-wide error” rate correction, simply corrects the pvalue threshold from the standard 0.05 to 0.05 divided by the number of tests, or the number of exposome variables/factors that are being modeled. This leads to a challenging question: how many exposures can be analyzed in an ExWAS, and what would the denominator be?
False discovery rate estimation
The FDR method was introduced by Benjamini and colleagues.21,75 The FDR is essentially the expected proportion of false discoveries among all the discoveries made. For instance, if you perform 100 tests and 10 of them show significant results, the FDR can help estimate how many of those 10 are likely to be false positives. The FDR allows researchers to control the rate of these false discoveries, reducing the likelihood of mistakenly identifying an environmental factor as influential when it's not. Exposome factors being tested are often not independent of each other. For example, exposure to one pollutant might be correlated with exposure to another. Traditional FDR methods assume each test is independent, but this assumption doesn't hold in many practical scenarios. Recognizing this, more advanced FDR methods have been developed that take into account the correlation between tests. These methods understand that finding a significant result in tests that are correlated is different from finding one in tests that are independent. By factoring in these correlations, these methods provide a more nuanced and accurate estimation of the FDR.
For instance, if several environmental factors are correlated, a discovery in one may increase the likelihood of a discovery in another. Advanced FDR methods may be able to consider the correlation (eg, the Benjamini-Yekuteli approach,76 or an empirical permutation-based approach77), ensuring that the overall rate of false discoveries remains controlled, even in the presence of these correlations.
The simplest approach to avoid false positives may be through validation in a held-out dataset, known as “sample splitting.”78 In this procedure, a large dataset is split into at least two sub-datasets. Then, all variable selection procedures are executed in one of the datasets, and inference takes place in the second dataset.79 Extensions of studies, such as “hierarchical” testing, for example, testing hypotheses at different levels of variables, may be especially appropriate in ExWAS, where exposure groups might be nested by behavior (eg, smoking behavior and biomarkers of smoking).
Variable selection during prediction
A common machine learning task includes combining a “variable selection” procedure to identify groups of variables that maximize predictive power as a collective. Although predictive performance is generally correlated with the number of model variables, investigators should consider a balance between interpretability (simpler models) and accuracy of the final predictive model. One popular procedure in ‘omics research includes shrinkage approaches80 (also known as regularization). The LASSO procedure (which stands for Least Absolute Shrinkage and Selection Operator)81,82 is an algorithm that “shrinks” or even sets some of the less important feature variables to zero, essentially removing them from consideration. By focusing only on the most significant features and ignoring the less relevant ones, LASSO prevents our model from getting too attached to the noise or irrelevant details, or too many variables, in the data. This approach helps reduce overfitting, ensuring our analysis or model is more robust and generalizes better to new, unseen data. Procedures similar to LASSO (ℓ1 penalty) include ridge (ℓ2 penalty) and elastic net83,84 regression (ℓ1 + ℓ2 penalties). There are also other variants,85-87 such as Group LASSO88 (selecting groups of predictors rather than individual variables), Sparse LASSO (optimized to select a small number of critical variables in high-dimensional data), and Sparse Group LASSO (selecting important groups of variables and also variable within groups).
It is important to emphasize that the ExWAS study design, at best, yields exposures that are reliably correlated with, but not necessarily causal of, a phenotype of interest. Given the dense correlational relationships between exposure factors, phenotypes, as well as often pervasive biases in ascertainment, sampling, and survey weighting, ExWASs are often just the first step in the process of winnowing a long list of exposures to then be analyzed for potential causal relationships with the phenotype of interest. Typically, after important variables are identified, one may conduct further studies to get better understanding between the selected variable and the outcome. These include but are not limited to: meta-analysis for reproducibility of the finding, mediation analysis for mechanistic insights, Mendelian randomization analysis for causal inference, and even molecular experiment to demonstrate the effects of the variables.
Sample versus variable size
In omics studies, a very large number of variables (genes, proteins, and metabolites, etc.) are typically related to an outcome. Sample sizes range from hundreds to thousands; only recently have we seen large scale multi-omics in cohorts such as UK Biobank. In typical cohort scenarios, however, smaller sample sizes creates an analytics scenario known as “large p (number of variables), small n (sample size).” Statistical models built with data whose dimension is larger than the sample sizes are prone to overfitting (ie, fitting the noise rather than the underlying signals). While the models can perform well with the training data, they have low generalizability and typically fail to reproduce the performance when fed with new data. Other issues include the multiple testing (greater type I error rate) and collinearity (high correlation) between variables when selecting impacting variables. Mixture analysis might also be underpowered owing to the small effect size of individual exposures. Sample size requirement for an ExWAS depends on the numbers and general effect sizes of exposures of interest and can be estimated using simulation. For example, in a post hoc power analysis with over 120 endocrine disrupting chemicals, we estimated that a sample size of ∼2700 is needed to achieve a statistical power of 0.8,89 which means that if there is a true effect, the test has an 80% probability of detecting it.
Conversely, in nationwide scale analysis, various questionnaire-based and targeted external measurements are integrated through personal identifiers or geocodes, and sample size could be in the millions or more. Overpowered associations are a major issue in this scenario.90-92 The huge sample size decreases the standard error, amplifying the ability to detect even miniscule differences in effect size. In extreme cases, the interpretation could overemphasize statistical differences in effect sizes that lack a clear biological relevance or may be “residually” confounded, eg, a 0.001% increase in an outcome for an unit change of an exposure. If the dataset is split and randomized, a pair of exposure-outcome relationships could be both statistically significant in the subsets, but with a flip of the directionality of effects, making it even harder to interpret the results.
Advanced methods in analyzing the exposome
We have covered basic concepts for exposome study and conducting an ExWAS. However, more advanced topics and methods are available to gain insights on the complex exposure-disease relationships based on the study context and research questions. Many of them are extensions of, or involve the application of, the ExWAS approach and are discussed below.
Methods to incorporate study design features: survey sampling, repeated measures, and time-to-events
Survey sampling
Sampling is the process of selecting a representative subset of individuals from a population in order to obtain population estimates. Commonly used methods include simple random sampling and stratified sampling. In contrast to conducting phone interviews, large-scale nationwide studies involving physical interaction (eg, in person interview and examinations) will require significant resources and pose logistical challenges if simple random sampling is used to identify participants. Therefore, complex multistage survey design is employed. It is best to illustrate the concept with NHANES, a bi-yearly study conducted by the Centers for Disease Control and Prevention. To provide a representative sample of the US population, a 4-stage survey design is used. To begin, primary sampling units (PSUs) mostly in county level are first selected (Stage 1), and segments, generally in city block level, within PSUs are subsequently sampled (Stage 2). Households within segments are randomly drawn (Stage 3), and finally individuals were selected at random in households (Stage 4). To obtain correct population statistics, software or statistical packages designed to incorporate PSU, strata, and survey weight information of the study must be used. An example is the investigation of relationships between 27 physiological markers and mortality in multiple NHANES survey cycles by Nguyen et al.93
Mixed linear modeling to account for repeated measures
Many statistical tests assume independence between observations. Violation generally causes the standard error and confidence interval of the estimates to be smaller, and thus deflating standard errors and increasing the chances of false findings. In practice, weak and random correlations are almost always observed, but this assumption is still largely valid. However, when sources of correlation are known in the study design, they can be considered by the model to control for spurious findings. Two typical correlational sources are clustering of individuals and repeated measurement of the same individuals over time. Clustering occurs when individuals are sampled through specific locations or institutions, such as enrollment of students for IQ tests through schools and patients for a disease study through hospitals. These designs have the advantages of reducing cost of sampling and variability of the data. Correlated data are analyzed with a mixed effect model94 where the correlating hierarchical or repeating unit is modeled as the random effect and other variables of interest are called fixed effects, for example, outcome, predictor, and potentially confounding variables. Alternatively, a generalized estimating equation method can be used if only population-averaged effects are concerned.18,95,96
Time to event outcomes
In a longitudinal study, individuals are followed over time and therefore time-to-event data is available. For example, a study on early life lead exposure and the later development of learning and behavior problems in children. We can conduct survival analysis97 to understand the relationship between an exposure to a delayed outcome. Specifically, a Cox proportional-hazards model can handle multiple predictors and estimate the hazard ratio for each predictor. In addition, because Cox model is still a regression based method, it fits into the ExWAS analytical framework for conducting exposome-level analysis.
Mixture analysis and additive polyexposure scores
Methods we previously introduced for variable selection can also be applied to identify important contributors to an outcome in mixture exposure settings. However, one of the significant limitations is that only the effects of individual exposures are known. This issue becomes more prominent when it is believed that concentration and impact of individual exposures are low but collectively they may have meaningful biological perturbation at molecular level or even an association to a clinical outcome.87 To address this problem, we can apply weighted quantile sum regression.86 The approach creates a summary score of the mixture for each individual and assesses the relationship between the scores and the outcome; however, the new score is a challenge to interpret. On the other hand, the model also estimates weights of individual exposures to the score, thus also enabling the identification of significant individual contributors to the overall mixture effects. Quantile-based g-Computation85 is a technique integrating weighted quantile sum (WQS) regression with g-computation. Same as the original WQS regression method, it can estimate the overall mixture effect while the parameters are calculated using a marginal structural model instead of standard regression.
Similarly, polyexposure scores (PXS)73,98-101 provide an alternative way to summarize the individual exposure risks to a disease, which are typically weak and nonstatistically significant, into a single predictive index for each subject. Building a PXS involves splitting the original data into three different subsets (training, validation, and testing) and employing multiple variable selection steps (eg, ExWAS and LASSO) to identify significant exposure factors to an outcome. Advanced methods are also available for different mixture exposure situations, such as Bayesian kernel machine regression80,102 and boosted regression trees103 for nonlinear response modeling and interaction screening. Identifying the optimal method for detecting health impacts of mixture exposures in the context of exposomics is an active research area and a wide range of methods has been discussed by others in a consortium setting.83,104-108
Going forward: infrastructure to support standardization of ExWAS and terminology
Exposomics and ExWAS is a fast growing field. GWAS were made possible by the standardization of the ways that genetic factors are digitally represented (eg, as genotypes with standard identifiers) and the ease of which samples may be collected (eg, in a case-control fashion) due to the lack of unknown confounding and static nature of genetic factors and genotypes. Standardization of ExWASs will be possible, but require advancement in not only analytic standards but also will depend on study design characteristics, some of which we articulated in this paper.
We foresee that a few emerging topics will be crucial to facilitate standardization of ExWAS studies in the future. First, the prevalence of open data and technologies to access these data, such as cloud computing will be beneficial. Since 2023, NIH requires all grant applications to submit a data management and sharing plan under a new policy.109 It is becoming a standard for study funders/providers to share their data via FAIR Data Principles—Findability, Accessibility, Interoperability, and Reusability of digital assets.110
Second, administrative healthcare data, such as electronic health records and administrative claims, contains comprehensive, codified, and longitudinal information about an individual’s health and disease status and drug prescription. These data have been instrumental in geospatial environmental health studies.9,111-114 Unlike data collected for observational cohorts, these datasets are not created for research.115-117 Data could be coming from a few major health care centers, and variations due to style of practices have to be considered. In addition, records are triggered by the severity of illnesses. An absence of disease records does not necessarily mean that the patient is disease/symptoms free. Repurposing these records for research requires that investigators have an understanding of the healthcare practice and coding system in order to draw valid conclusions from the analysis.
Third, the functional exposome encompasses a subset of biologically active exposome.5 Semi-agnostic methods for functional exposomics, which are different from targeted and non-targeted measurement approaches, could become the mainstream driving molecular exposure and multi-omics analysis. These methods include semi-targeted analysis,118-120 suspected screening,121-123 adductomics,124-127 and affinity-based measurement128 and they are characterized by striking a balance between throughput and interpretability of measurement. An understanding of data generation processes is essential to ensure correct application of analytical methods and interpretation of results. Furthermore, the application of the functional exposome concept to the One Health concept129 could enable holistic and integrated studies between humans, animals, and the external environment.
Finally, the term environment-wide association study (EWAS) was coined by Patel et al. in 2010,16 but variants are common, including environmental-wide association study,130 exposome-wide association study,89 exposure-wide association study17 with acronyms such as XWAS, EnWAS, and ExWAS. To further complicate the issue, EWAS is also an acronym for epigenome-wide association study, which appears in the literature around the same period of time. The ambiguity makes it challenging to search for relevant studies and creates confusion among researchers. In light of this, we propose to standardize the nomenclature with the term “exposome-wide association study, ExWAS”, pronounced as “x-wahz”, for any data-driven study that associates multiple and diverse exposome-based exposures with a phenotype or multiple phenotypes and involves correction for multiple comparisons and elucidation of replication. In discipline-specific analyses, such as nutrient-wide association study, and drug-wide association study, ExWAS could be tagged as a keyword for effective paper retrieval during literature reviews.
Uncovering the contribution of the environment to diseases presents a significant challenge and requires advancements in both measurement technologies and data analytical methods. From a data scientist’s perspective, embracing the use of large cohorts and repurposed datasets, along with the application of the latest analytical techniques, could enable novel discovery of the elusive relationships between the environment and diseases.
Acknowledgments
We thank Heidi Hanson and Ander Wilson for providing comments to improve the manuscript. We thank the participants of the NIEHS Exposome Workshop (https://factor.niehs.nih.gov/2022/9/feature/2-feature-exposomics-research), and the Exposomics Consortium (https://www.exposomicsconsortium.org)
Funding
M.K.C. and C.J.P. were supported by grants from the U.S. National Institutes of Health through the National Institute for Environmental Health Sciences (ES032470, P30ES000002), the National Institute on Aging (AG074372), and from the U.S. National Science Foundation through the Northeast Big Data Innovation Hub. A.M.R., J.S.H., and F.S.A. are supported by intramural funds from the National Institute of Environmental Health Sciences. M.A.L. was supported by grants from the U.S. Environmental Protection Agency (G17D112354237), from the U.S. National Institutes of Health through the National Institute of Diabetes and Digestive and Kidney Diseases (R01DK125586), and from the U.S. Department of Veterans Affairs (HX002680).
Author contributions
Ming Kei Chung (Conceptualization [equal], Project administration [equal], Writing—original draft [equal], Writing—review & editing [equal]), John S. House (Writing—original draft [equal], Writing—review & editing [equal]), Farida S. Akhtari (Writing—original draft [equal], Writing—review & editing [equal]), Konstantinos C. Makris (Conceptualization [equal], Writing—review & editing [equal]), Michael A. Langston (Writing—review & editing [equal]), Khandaker Islam (Writing—review & editing [equal]), Philip Holmes (Writing—review & editing [equal]), Marc Chadeau-Hyam (Conceptualization [equal], Writing—review & editing [equal]), Alex I. Smirnov (Writing—review & editing [equal]), Xiuxia Du (Writing—review & editing [equal]), Anne E. Thessen (Conceptualization [equal], Writing—review & editing [equal]), Yuxia Cui (Writing—review & editing [equal]), Kai Zhang (Conceptualization [equal], Writing—review & editing [equal]), Arjun K. Manrai (Conceptualization [equal], Writing—review & editing [equal]), Alison A. Motsinger-Reif (Conceptualization [equal], Writing—original draft [equal], Writing—review & editing [equal]), Chirag J. Patel (Conceptualization [equal], Supervision [equal], Writing—original draft [equal], Writing—review & editing [equal]), and Members of the Exposomics Consortium* (Conceptualization [equal]).
Data availability
This study does not generate new data or reanalyze any existing datasets.
Conflict of interest statement
The authors declare that they have no conflicts of interest.
References
Author notes
For full consortium author list, please see: https://www.exposomicsconsortium.org/view/EXPOSOME-2023-007