Epigenetic clocks and research implications of the lack of data on whom they have been developed: a review of reported and missing sociodemographic characteristics

Abstract Epigenetic clocks are increasingly being used as a tool to assess the impact of a wide variety of phenotypes and exposures on healthy ageing, with a recent focus on social determinants of health. However, little attention has been paid to the sociodemographic characteristics of participants on whom these clocks have been based. Participant characteristics are important because sociodemographic and socioeconomic factors are known to be associated with both DNA methylation variation and healthy ageing. It is also well known that machine learning algorithms have the potential to exacerbate health inequities through the use of unrepresentative samples – prediction models may underperform in social groups that were poorly represented in the training data used to construct the model. To address this gap in the literature, we conducted a review of the sociodemographic characteristics of the participants whose data were used to construct 13 commonly used epigenetic clocks. We found that although some of the epigenetic clocks were created utilizing data provided by individuals from different ages, sexes/genders, and racialized groups, sociodemographic characteristics are generally poorly reported. Reported information is limited by inadequate conceptualization of the social dimensions and exposure implications of gender and racialized inequality, and socioeconomic data are infrequently reported. It is important for future work to ensure clear reporting of tangible data on the sociodemographic and socioeconomic characteristics of all the participants in the study to ensure that other researchers can make informed judgements about the appropriateness of the model for their study population.


Introduction
DNA methylation (DNAm) is an epigenetic modification to DNA that is involved in a number of aspects of genome regulation. DNAm is dynamic, changing as we age [1], in the course of disease [2] and in the presence of environmental exposures such as smoking [3] and childhood adversity [4,5]. Ageing is a particularly well-established influence on DNAm, with changes occurring at many DNAm sites across the genome as humans get older [6]. These changes are so reliable that numerous research groups have developed 'epigenetic clocks', mathematical models that use DNAm measurements to predict chronological age and other age-related characteristics [7].
The first generation of epigenetic clock methods (e.g. Horvath and Hannum [1,8]) aimed simply to predict chronological age as accurately as possible. It was observed early on that errors in age prediction, subsequently known as 'age acceleration' or 'age deceleration' (depending on the direction of the error), were associated with a variety of exposures and diseases as well as mortality risk [9]. These associations suggested that, beyond chronological ageing, these clocks may actually provide a measure of biological ageing (which refers to the progressive decline of the body's physiological functions [10]). To enhance this feature, a second generation of epigenetic clocks was developed; these incorporated health biomarkers and phenotypes alongside chronological age in the model. Increasing numbers of studies are now considering how social determinants of health might impact epigenetic ageing, with evidence indicating positive associations between accelerated epigenetic aging and social adversity measured in relation to socioeconomic position [11], education [12], neighbourhood characteristics [13][14][15], and racial discrimination [15].
Although the epigenetic clocks themselves have been quite extensively reviewed [7,[16][17][18], relatively little attention has been paid -in both these review articles and also the empirical studies that use these clocks -to the sociodemographic characteristics of the individuals whose data have been used to develop these clocks. This sociodemographic information is important in order to assess the generalizability of the clocks to different populations, given robust evidence that (i) variation in DNAm is altered by genetic, social, and environmental factors [3,[19][20][21][22][23][24], (ii) adverse social and biophysical exposures are socially patterned and affect the risk of disease and also processes of healthy ageing [25][26][27], and (iii) study selection pressures (which may be related to sociodemographic characteristics) can induce biased study estimates [28][29][30][31], which could lead to problems with clock generalizability. Our concern is that for these reasons an algorithm developed to estimate ageing using DNAm in one population may not translate to other populations with divergent sociodemographic or socioeconomic profiles. For example, a clock may miss environmental impacts on the methylome that contribute to ageing in some populations if it is developed in a population with low exposure to those impacts, e.g. pollution or social adversity [32]. This is not a new concern, as there is existing literature describing the potential for machine learning algorithms to exacerbate health inequities through the use of unrepresentative samplesthis may happen when prediction models underperform in social groups that were poorly represented in the training data used to construct the model [33][34][35]. If prediction models underperform, then estimates will be less accurate, which could bias associations in either direction. Because of the nature of epigenetic clocks, the direction of bias is likely to be quite complicated. Epigenetic clocks are made up of many CpG sites (often hundreds) scattered across the genome. The dynamic and regulatory nature of DNAm therefore implies that clocks will integrate biological signals from a wide variety of biological functions and pathways (which for the most part are unknown). The responsive nature of DNAm implies that they will capture variation due to a multitude of endogenous biological factors and external exposures that differ between populations. As such this bias may either over-or under-estimate the effects, and we do not think it would be easy to predict which direction the bias might go, even for specific studies.
The possibility of bias has already been described for PhenoAge, one measure of epigenetic age [36]; it is a particular concern for epigenetic data as two recent reviews show that, at least in regard to racialized groups, epigenetic data are predominantly from individuals of 'European' heritage [37,38] (although no such review has been conducted for other sociodemographic characteristics). Wide-spread inconsistencies in the literature about associations with epigenetic age acceleration support this concern, as inconsistencies are likely driven by differences in characteristics between study populations. For example, inconsistent associations have been reported with education [12,15,[39][40][41][42], socioeconomic status [15,39,41,[43][44][45][46][47][48][49][50], and racialized group [51][52][53][54]; as well as inconsistent associations between epigenetic age acceleration and health and social outcomes when analyses are stratified by sociodemographic characteristics, such as education level [55], country of birth [40], and racialized group [42,55,56]. One recent study reported that when testing the association between epigenetic clocks and healthspan-related characteristics, smaller effect sizes were found for Black American participants in comparison to white American participants [54], suggesting it is possible that associations could be biased towards the null in some study populations. These inconsistencies may be due in part to some clocks including loci known to be differentially methylated by country of birth [40] and racialized group [36,45].
Whatever the reason for these inconsistencies, it is clear that any interpretation of associations with epigenetic age acceleration should take into account any differences between the populations in which the epigenetic clock was derived and the populations where the associations were observed. To assist in these comparisons, we collate for the first time the basic social characteristics of the participants utilized in the development of 13 commonly used epigenetic clocks. Table 1 contains extracted information from each of the clock papers. Supplementary Table S1 reproduces Table 1 along with additional details of the conceptualization of racialized groups and genders. Supplementary Table S2 provides details of how each clock was constructed and the specific datasets used by each paper.

Age, Biological Specimen Collection Date, and Birth Cohort
Eleven of the 13 papers report the chronological age of all their training and test dataset participants. Of these, three models included children in their training and test datasets: two included participants from birth to adulthood, with no information as to the biological specimen collection dates or birth cohorts of participants (aged 0-100 years [1]; aged 0-78 years [57]), and one model included participants aged 2-104 years old, with 1/14 cohorts born in 1936 and one in 1921 [58]. Four papers reporting the chronological age of all their participants report using only adult participants (aged between 16 and 101 years) without reporting specimen collection date or birth cohort [8,[59][60][61]. One paper reported using only adult participants (ages 31-82 years) and reported specimen collection date (2000-08) and birth cohort  for all of their participants [62]. One paper reported age for all participants (adults, aged 20-91 years) and reported the specimen collection date (1988-2014) and birth cohort (1907-94) for the majority of their participants (4/5 datasets) [63]. Two papers reported using adult participants (aged 21-93 years), with information on specimen collection date (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) and birth cohort (1906-86) for some of their participants [55,64].
One paper reported age for their training data plus 81% of their test data, plus biological specimen collection date for all participants (1998-2013) and birth cohort for the majority of participants (1917-95) [65]. One paper did not report age but included it as the Y-axis on plots, illustrating that participants were ∼20-90 years old, with no information on biological specimen collection date or birth cohort [66].

Sex or Gender
Biological sex or gender were the most frequently reported characteristics, with seven out of 13 clock papers reporting data for all participants, and three out of 13 reporting partial data. Of the seven reporting full data, four report 'sex' using the biological sex terms 'males' and 'females'. These papers included between 48% and 83% females in their training and test datasets [55,62,64,65]. One paper reports 'gender' using the gendered terms 'women' and 'men', including 52% women [8], and two report using the terms 'female' and 'male' without specifying sex or gender, including between 0% and 58% female participants [1,60]. Of the three studies reporting partial data on sex/gender, one reported on one of eight datasets (100% women, with no use of the terms sex or gender) [61]; one reported on two of five test datasets (46% women, no      [63]; and one paper reports 'gender' using the terms 'female' and 'male', including 81% female participants in the training data, with no reporting for the test datasets [57]. Three studies use 'sex' and one used 'gender' as a covariate in the age estimation model, but none conceptualize the reason for this [8,55,59,62]. None of the papers defined what they meant by the terms sex or gender beyond three studies identifying it as a 'covariate' or 'confounder', so where terminology is mixed it is unclear which specific characteristic is being reported. Being specific about the characteristic that is reported is important because biological sex and gender identity have independent and interacting effects on health [67][68][69][70].

Racialized Groups and Country Context
Only two papers report data on racialized group membership for all the participants in their study, using descriptive terms pertaining to 'race', 'ethnicity', and 'ancestry' (but without defining what these terms mean, whether these data were self-reported by participants or obtained from medical records, and with limited reporting of country context). The paper developing the Hannum clock [8] was developed using a training dataset including '426 Caucasian and 230 Hispanic individuals', with no country context reported. We note that there is scientific recognition that the origins of the term 'Caucasian' mean its use should be discontinuedthis is because the term was derived from a scientifically fallacious typology that presumed humanity originated in the Caucuses and was 'white', with other 'racial' groups framed as 'degenerate' lineages that branched off of the 'Caucasian' trunk [71][72][73]. The paper developing the DNAmTL clock [64] utilized training and test datasets; the training dataset comprised a total of 2256 individuals from two datasets, of whom 81% were categorized as being 'African ancestry' in the text and 'African American' in the tables; and 19% as 'European ancestry' in the text and 'European' in the tables. The country context was provided for one of these datasets (one was based in the US), so it is not possible to fully contextualize the reported racialized groups based on information reported in the paper. Their first test data comprised 1078 individuals from three datasets (Test and Training 1 comprise unique sets of individuals from two datasets). They report that 86% were categorized as being 'European ancestry' in the text and 'European' in the tables, and 14% as 'African ancestry' in the text and 'African American' in the tables. The country context was provided for 2/3 of these datasets (two were based in the USA), so again it is not possible to fully contextualize the groups. A second test dataset comprised 9359 individuals from five additional datasets, of whom 77% were categorized as 'European ancestry' in the text and 'European' in the tables, 15% as 'African ancestry' in the text and 'African American' in the tables, and 8% as 'Hispanic ancestry' in the text and 'Hispanic' in the tables. The country context was reported for all five datasets: one in the UK, two in Scotland, one in the USA, and one in Italy.
Five papers that developed an epigenetic clock reported partial data on racialized group membership but did not report clear numbers for all participants in the paper. The paper developing the Horvath clock [1] reported racialized group membership for 5/39 training datasets -participants from four datasets (four brain regions from the same individuals) were categorized as 'non-Hispanic Caucasian ethnicity' (country context not reported), and participants from one dataset were categorized as 'Taiwanese' (country context not reported). Racialized group membership data was reported for 1/32 test datasets, where participants were categorized as 'Gambian' (country context not reported). The paper developing the DunedinPoAm clock [65] reported that 93% of their training dataset participants, all born and residing in New Zealand, were 'white' but did not explicitly report data on racialized groups for the remainder of their training data participants. They state that test dataset participants were 'mostly of white European descent', reporting that 77% of participants in one of the four test datasets were 'white' with no country context; the country context was reported for the other three test datasets with no information on racialized groups (one recruited in the USA and two in the UK). The paper developing the GrimAge clock [55] did not report the racialized group memberships of their US-derived biomarker selection and training dataset (70% of the Framingham Heart Study offspring cohort). Among individuals in their five test datasets, 50% were categorized as 'European ancestry (Caucasians)', 40% as 'African American', and 10% as 'Hispanic'; the country context was not integrated, with 2/5 of the test datasets based in the USA and 1/5 in Italy. The paper developing the Phe-noAge clock [63] did not report the racialized group membership of the participants in which the biomarkers were selected the National Health and Nutrition Examination Survey III and IV; both are US national datasets), or in whom the DNAm predictor was trained (InChianti, for which they do not report the country context); they reported the racialized group membership of participants in four of the five test datasets, all of which were based in the USA, comprising 72% of test participants, of whom 18% were categorized as being 'Black', 26% as 'African American', 41% as 'White', and 11% as 'Hispanic'. No data on racialized group membership was provided for the remaining 36% of test participants from two other datasets. Finally, Zhang's et al. paper developing an elastic net predictor [58] reported data on racialized group membership for one of 14 datasets, where they state that the 695 participants from the motor neuron disease (MND) cohort dataset were 'Chinese' but did not provide further information or country context.
One study reports only the country context of training and test datasets (both based in Germany) [62]. Another reported only country context for 1/2 training datasets and 1/2 test datasets (both based in Germany) [57]. The remaining four studies [59][60][61]66] report no information about the racialized group membership or country context of their participants. None of the papers that did report data categorized by 'racial' or 'ethnic' terms that they employed explicitly explained or justified their conceptualization or usage of racialized groups, beyond three studies conceptualizing use as a 'covariate' or 'confounder'. One study [8] includes 'ethnicity' as a covariate in the age estimation model but does not conceptualize the reasons for this.

Nativity
Only one study explicitly reported the nativity status of their training dataset participants -100% of the Dunedin PoAm [65] training dataset study participants were born in New Zealand (with no report as to the nativity status of the test dataset).

Economic Measures
Only the Dunedin PoAm study [65] included data for some participants on economic status in the paper. For the biomarker selection and training dataset, they state that 'The cohort represents the full range of socioeconomic status on NZ's South Island', but they do not provide descriptive statistics that might enable comparison across datasets. For one of four of their test datasets (41% of test participants), they provide neighbourhood-level socioeconomic data alongside national statistics, illustrating that the sample is broadly representative of the UK in terms of neighbourhood SES.

Education
Two papers developing epigenetic clocks included data on educational attainment for some of their participants. The first paper, developing the GrimAge clock [55], reported education data for 41.7% of their test dataset participants; they were generally highly educated, with a quarter (24.2%) having a college degree or higher, and only 8.8% having less than high school education (14.8% had a high school degree, and 52% had some college education). The second paper, developing the DNAmTL clock [64], reported education data for 54.9% of their test dataset participants. A higher proportion had less than high school education (16.6%), with the remainder of the sample generally highly educated: 16.7% had a high school degree, 33.6% had some college education, and 33.2% had a college degree or higher. Neither paper reported information on the education level of their training dataset, which is critical to the transportability of the predictor.

Validation across Social Groups
Eight of the 13 papers reporting epigenetic clock methods report some form of validation of their model in their test data stratified by one or more sociodemographic characteristics. One (developing the DNAmTL clock) compared model performance between men and women in the two test datasets, and additionally between racialized groups in one of these test datasets, by correlating the clock estimate with chronological age [64]. In the first test dataset, they found a lower correlation for women even though the training data comprised 75% women. In the second test dataset, they found a higher correlation for women. We suggest this shows the importance of the need to consider multiple participant characteristics when validating predictive algorithms. There was no difference in clock correlation with chronological age between the two racialized groups ('black' and 'white' participants). The paper developing Zhang et al.'s mortality clock [62] stratified downstream analyses by sex/gender; in their test dataset, they found stronger associations in men despite a slightly higher proportion of women in the training data. The paper reporting the Bocklandt clock [60] used 100% 'males' in their training data and tested clock performance in males and females in their test dataset. They found a lower correlation between chronological and predicted age for females compared to males, as well as a higher error between predicted and chronological age for females, suggesting reduced accuracy of the clock. The paper developing the PhenoAge clock [63] looked at the correlation between the age estimator and chronological age stratified by the racialized groupthey found slightly higher correlations in 'Hispanic' participants, with similar correlations for 'black' and 'white' participants. The paper developing the GrimAge clock [55] stratified all analyses in the main text by the racialized group, conceptualizing this as testing whether the ageing predictor applied to each group; no conclusions from this were presented in the paper, but in a supplementary analysis shows higher standard deviations of the age estimator for 'Black' participants and lower standard deviations for 'Hispanic' participants, with 'white' participants having standard deviations in the middle of these groups. However, racialized group membership was not reported for the training dataset participants. Additionally, in the supplement they stratify model predictions by educational attainment and find that the model performed for participants of all education levels (although the lowest hazard ratio was found for less than high school education). One paper compared the difference between predicted and chronological age for participants with differing numbers of years in education, showing no difference at P = 0.05 [57] and presented plots comparing clock performance in 'males' and 'females' but did not provide statistics or elaborate on the comparison in the text. One paper developing an epigenetic clock stated that no difference was observed in clock performance between genders but did not present data [61]. One paper presented plots comparing clock performance in 'males/men' and 'females/women' but did not provide statistics or elaborate on the comparison in the text [65]. One paper compared the ageing rate between men and women but only in the test dataset (which we did not consider to be validation) [8].

Discussion
The basic sociodemographic characteristics of participants are generally poorly reported in the 13 papers which developed the most popular epigenetic clocks. This makes it challenging for researchers to judge whether the clock is likely to accurately transport to the population they want to study, where the estimation of epigenetic age may be inaccurate in populations with different characteristics, introducing uncertainty to the relationship between epigenetic age and health and social outcomes, therefore biasing estimates in uncertain directions. This is important because different populations are likely to have different socially patterned social, economic, and biophysical exposures that affect their methylomes, and so clocks developed in a socially homogeneous population may not transport well to a population with different social characteristics and different exposures and experiences.
Chronological age was reported for all study participants in 11 of 13 papers. Of these, three include infants and children in model development, meaning that researchers should be cautious about the application of other models to data from children. The majority of other studies included a wide range of adult ages. However, only six of the papers report the biological specimen collection dates, some with information reported on birth cohort (although birth cohort can be derived easily from age and specimen collection date). Biological specimen collection date and birth cohort are important because this means researchers can ascertain whether participants lived through events or situations that might have had impacts on health and health equity. To properly analyse issues related to health equity, it is crucial to combine this with data on place and other sociodemographic characteristics. Participants in the six epigenetic clock models where birth cohort could be derived were born from as long ago as 1906 to as recent as 1995, meaning a range of historical and social events (such as war, economic crisis, and changes in social environment) may have been experienced by participants, depending on their other characteristics.
Where gender or sex was reported, with one exception clocks were trained on both male and female/women and men participants; however, gender or sex was not reported for all participants in five of the 13 studies, and none included any discussion of differences between the influences of sex-related biology and societal gender, or their potential interaction. The two clocks reporting data on the racialized group membership of all participants in their training dataset did not provide country context for all of their participants. Each included individuals from two racialized groups: one included participants predominantly categorized as being 'Caucasian' (with no country context) and the other included participants predominantly categorized as being of 'African ancestry' or 'African American' (with at least some of these participants living in the USA). Only two studies (the papers developing DNAmTL and GrimAge) reported the racialized group membership of all participants in their test dataset (this information is not as pertinent as the training dataset); neither fully integrated country context, with one study including persons located in the USA, the UK, Scotland, and Italy, as well as some participants from unspecified countries, who were categorized as being 'European' or 'European ancestry', 'African American' or 'African ancestry', and 'Hispanic' or 'Hispanic ancestry'; the other study included persons living in the USA and Italy, as well as unspecified countries, who were categorized as being 'European ancestry (Caucasians)', 'African American', and 'Hispanic'. Both of these studies included a majority of participants categorized as being 'European' or 'European ancestry', some of whom were located in Europe, some in the USA, and for some participants, there was no country context. Only one paper reported the nativity of their participants alongside the country context.
We note that none of the papers that we reviewed explicitly explained or justified their conceptualization or usage of racialized groups, even when they reported data on these characteristics, utilizing them as covariates in the model in one case, or stratified their analysis according to categories they employed for 'race', 'ethnicity', or 'ancestry'. This mirrors the findings of a recent systematic review examining how a large number of epigenetic studies poorly incorporate data on social groups and social determinants of health [74]. The inclusion of sociodemographic characteristics as features in models such as epigenetic clocks needs to be thoroughly conceptualized and justified because the inclusion of these characteristics could exacerbate inequities by adjusting away inequalities experienced by individuals with these characteristics [75,76], meaning inequalities in biological age would be masked by the inclusion in the clock algorithm.
Crucially, only three of the papers we reviewed presented tangible data pertaining to the socioeconomic circumstances of their participants. Two reported education levels for 41.7% and 54.9% of their test dataset participants, where participants were generally highly educated (with the majority having at least some college education). One reported neighbourhood-level economic data for 41% of test dataset participants alongside national figures, illustrating that the sample was broadly representative of the UK. However, none of these papers reported education or economic data for their training data, which are the critical dataset to report. This information is essential to assess the transportability of these clocks to other datasets; it is also important to ensure that health inequalities are not masked or perpetuated in epigenetic research (this may happen when prediction models underperform in social groups that are poorly represented in their training data). The lack of reporting that we find is likely to be due to at least in part the absence of social characteristics in publicly available datasets such as those on GEO; biological data repositories have previously been criticized for a lack of social characteristics of their participants because this prevents the investigation of health inequities that exist between social groups [77]. We would like to reiterate this need for socioeconomic data in the context of epigenetic datasets, as well as the importance of obtaining and reporting these data from cohort studies that have epigenetic data.
Eight of the epigenetic clock models make efforts to validate their models in participants stratified by one or more sociodemographic characteristics, including sex/gender, racialized groups, and education level. Some suggested that there may be little difference between the groups tested, whereas some suggested lower accuracy in groups dissimilar in some way to the training population. However, we suggest that validation methods ought to be improved beyond simple testing of correlation between the clock model and chronological age, or testing downstream associations and that papers should consider multiple sociodemographic characteristics in these validation analyses and ensure to give them due consideration as an important part of the manuscript. None of the eight papers followed up any differences they found in the discussion, or relate differences they find to the characteristics of their training dataset, missing important opportunities to delve into whom these clocks may and may not apply to.
In conclusion, we find that although some of the epigenetic clocks were created utilizing data from datasets including individuals from different sexes/genders and racialized groups, this information is limited by inadequate conceptualization of the social dimensions and exposure implications of gender and racialized inequality, the absence of any socioeconomic data, or any consideration of interactive effects involving these social groups, along with a frequent failure to be clear on the countries from which the data were obtained and also the nativity of the participants. As a result, it is difficult to conclude how transportable the epigenetic clocks with poorly characterized sociodemographic data may be and which social groups they might apply to. Future epigenetic research should ensure to report these important participant characteristics, in combination, to contextualize their work; to properly investigate health inequities, we recommend that at a minimum researchers should collect and report both individual-level and structural-level data as one of our authors has previously suggested [78]. Researchers working with existing methods should ensure they check (where possible) the characteristics of the participants used to generate the clocks against their own population of study. They should also be mindful of the possibility of inaccurate prediction if the population the clock was developed in does differ (or is unknown), and ensure to report this as part of any published work. With the increasing use of epigenetic clocks to conduct work into social determinants of health, an important piece of future work would be to obtain primary data study (where available) to ascertain a more complete picture of the populations in which these epigenetic clocks were developed and what impacts this may have had on the conclusions of subsequent studies using the clocks. This is particularly important to enable studies to address inequalities in health.

Methods
We included the 13 epigenetic clocks discussed in a recent review that either provide the CpG sites used to construct the clock or provide the means to calculate it [7] -this includes all clocks commonly used in the literature. We extracted participant sociodemographic and socioeconomic characteristics, as reported in the original clock papers and all associated Supplementary material. Where applicable, we extracted information separately for training and test data, as the DNAm data in which the clocks were trained are the most pertinent information. For the secondgeneration clocks, if biomarkers were selected in a separate cohort, we also extracted information about that cohort. We extracted a number of social characteristics of participants that are important for understanding inequalities in health. The participant characteristics we extracted, as characterized in the studies, pertained to age, biological specimen collection date, and birth cohort; sex or gender; racialized groups (including 'race', 'ethnicity', and 'ancestry'); nativity (whether an individual was born in the country of recruitment); country context (the country in which the participants were recruited, identifying the societal structures in which people live); socioeconomic position (e.g. as measured by income, social class); and education; as well as reported validation across social groups. Extracting the data, we use the terms used in the original paper and note where terminology is problematic. Where age and either specimen collection date or birth cohort were reported, we calculated the missing value using the available data.

Data availability
All data used in this manuscript are available in the original papers that developed the 13 epigenetic clock methods (all references are contained in Supplementary Table S2).

Supplementary data
Supplementary data are available at EnvEpig online.