Exploring the role of genetic confounding in the association between maternal and offspring body mass index: evidence from three birth cohorts.

BACKGROUND
Maternal pre-pregnancy body mass index (BMI) is positively associated with offspring birth weight (BW) and BMI in childhood and adulthood. Each of these associations could be due to causal intrauterine effects, or confounding (genetic or environmental), or some combination of these. Here we estimate the extent to which the association between maternal BMI and offspring body size is explained by offspring genotype, as a first step towards establishing the importance of genetic confounding.


METHODS
We examined the associations of maternal pre-pregnancy BMI with offspring BW and BMI at 1, 5, 10 and 15 years, in three European birth cohorts (n ≤11 498). Bivariate Genomic-relatedness-based Restricted Maximum Likelihood implemented in the GCTA software (GCTA-GREML) was used to estimate the extent to which phenotypic covariance was explained by offspring genotype as captured by common imputed single nucleotide polymorphisms (SNPs). We merged individual participant data from all cohorts, enabling calculation of pooled estimates.


RESULTS
Phenotypic covariance (equivalent here to Pearson's correlation coefficient) between maternal BMI and offspring phenotype was 0.15 [95% confidence interval (CI): 0.13, 0.17] for offspring BW, increasing to 0.29 (95% CI: 0.26, 0.31) for offspring 15 year BMI. Covariance explained by offspring genotype was negligible for BW [-0.04 (95% CI: -0.09, 0.01)], but increased to 0.12 (95% CI: 0.04, 0.21) at 15 years, which is equivalent to 43% (95% CI: 15%, 72%) of the phenotypic covariance. Sensitivity analyses using weight, BMI and ponderal index as the offspring phenotype at all ages showed similar results.


CONCLUSIONS
Offspring genotype explains a substantial fraction of the covariance between maternal BMI and offspring adolescent BMI. This is consistent with a potentially important role for genetic confounding as a driver of the maternal BMI-offspring BMI association.

Note S1: Published phenotypic associations between maternal pre-pregnancy adiposity and offspring adiposity from childhood to adulthood We conducted a literature search with the aim of identifying studies that examined the association between maternal pre-pregnancy adiposity and offspring adiposity in childhood or adulthood, and that presented results both unadjusted and adjusted for potential confounders. We used the following Medline search terms: 1. maternal 2. AND pregnancy 3. AND BMI OR body mass index 4. AND offspring OR child We did not limit results by year of publication. The search returned 2693 results, which we assessed against the following inclusion and exclusion criteria:  Studies which presented a regression coefficient for the association between a continuous measure of maternal adiposity (pre-pregnancy or during pregnancy) and a continuous measure of offspring adiposity (at age > 1 year), both adjusted and unadjusted for potential confounders Exclusion criteria  Studies which adjusted for offspring body size variables (e.g. birth weight, childhood body mass index (BMI)), which are potential mediators  Studies which only presented results from analysis treating exposure and/or outcome as categorical variables  Studies for which the outcome was change in adiposity (as opposed to absolute adiposity level) After screening titles and abstracts against the criteria above, 64 studies remained. After assessing the full text of articles, 14 studies remained, and are presented in Table S1 below.
Based on triangulated evidence from several study designs it is likely that confounding is an important driver of the maternal adiposity-offspring child/adolescent adiposity association (Main Text). The studies that we have identified here carried out adjustment for a variety of potential confounders, and in almost all studies this adjustment had a negligible impact on the magnitude of the association. There are two possible explanations for this finding: first, that confounders were measured poorly, and second, that other unmeasured confounders (for example genotype) have an important effect.   (19).

Note S3: Sample derivation flow charts
Flow charts are presented below for each cohort, and for the combined cohorts ( Figure S4). For the combined cohorts the final sample size for each phenotype is not equal to the sum of the equivalent sample sizes for the individual cohorts, because the relatedness exclusion filter resulted in removal of different numbers of participants when applied to the combined cohorts as opposed to the separate cohorts. GCTA-GREML (Genomic-relatedness-based Restricted Maximum Likelihood implemented in the GCTA software) requires that cryptic (unknown) relatedness be removed to avoid confounding due to familial environment and non-additive genetic effects (20). Sensitivity analyses demonstrated that results of our primary analyses were similar at a broad range of relatedness thresholds (Notes /Tables/Figures S23-S27), and we found that in the NFBCs using a threshold below 0.05 resulted in exclusion of a large proportion of the sample (Table S50), since Finland has a relatively small founding population. Consequently, after merging data from the three cohorts we removed one individual from each cryptically related pair, using a relatedness threshold of 0.05. Genotyping was carried out at the Broad Institute using the Illumina Infinium 370cnvDuo array and the Beadstudio calling algorithm, as described previously by Sabatti et al. (21). Individuals were excluded due to call rate < 95%, unspecified sex, sample duplication/contamination, sex mismatch, relatedness (identity by descent [IBD]), outlying heterozygosity or withdrawal of consent, giving a sample size of 5400. Population stratification was assessed by multidimensional scaling analysis (MDS) and compared with Hapmap phase 3 reference populations; no individuals of non-European ancestry were detected. Copy number variations (CNVs) and single nucleotide polymorphisms (SNPs) with call rate < 95% (for markers with minor allele frequency [MAF] > 5%), CNVs and SNPs with call rate < 99% (for markers with MAF <5%), lack of Hardy-Weinberg equilibrium (HWE) (P < 1.0 x 10 -4 ) or MAF < 1% were excluded.

NFBC1986
Genotyping was carried out at Imperial College London and the University of Oxford using the Illumina HumanOmniExpressExome-8v1.2 array and the GenomeStudio calling algorithm. Individuals were excluded due to call rate < 99%, sex mismatch, relatedness (reported and identity by state [IBS]), outlying heterozygosity and non-random sampling, giving a sample size of 3384. Population stratification was assessed by MDS and compared with Hapmap phase 3 reference populations; no individuals of non-European ancestry were detected. SNPs with call rate < 99%, lack of HWE (P < 1.0 x 10 -4 ) or MAF < 1% were excluded.

ALSPAC
Offspring genotyping was carried out at the Wellcome Trust Sanger Institute, Cambridge, UK and the Laboratory Corporation of America, Burlington, NC, USA using the Illumina HumanHap550 quad chip array. Individuals were excluded on the basis of gender mismatches, minimal or excessive heterozygosity, disproportionate levels of individual missingness (> 3%), and insufficient sample replication (IBD < 0.8). Population stratification was assessed by MDS and compared with Hapmap phase 2 reference populations; all individuals with non-European ancestry were removed, giving a sample size of 9115 (prior to application of the other exclusion criteria for the present study). SNPs with call rate of < 95%, lack of HWE (P < 5.5 x 10 -7 ) or MAF < 1% were excluded.
Note S6: Selection of offspring anthropometric measurements within age windows Offspring anthropometric measurements during childhood and adolescence were used at birth and at four age windows throughout childhood and adolescence, giving mean ages close to 1, 5, 10 and 15 years (target ages) ( Table S7, Table 1). Age windows were chosen to maximise sample size and obtain measurements with a similar age distribution for all three cohorts; within windows data points were first selected to prioritise higher quality data sources (clinical exam > growth record > questionnaire) and second to minimise the age difference from the aforementioned target ages. The source of the measurements used at each age for each cohort is given in Table S8.   Note S9: Equations for phenotypic variable standardisation and phenotypic covariance, and statistical details of the GCTA-GREML model

Phenotypic variable standardisation
We standardised phenotypic variables using the following formula: = ( − ̅ )/ , where denotes the value of the standardised variable for the individual, denotes the unstandardised variable, and ̅ and denote the mean and standard deviation respectively of the unstandardised variable in the combined cohorts.

Phenotypic covariance
Phenotypic covariance between maternal BMI and offspring phenotype was calculated using the usual formula: where and are the values of maternal BMI and offspring phenotype respectively for the mother-child dyad, and ̅ and are the corresponding means.

GCTA-GREML
For our primary analyses we used the bivariate GCTA-GREML model to partition the phenotypic covariance between maternal BMI and offspring phenotype (CovP) into that explained by additive genetic effects captured by offspring SNPs (referred to as the genetic covariance [CovG]) and residual (unexplained) covariance (CovE). We then calculated the ratio CovG:CovP, which has been referred to as the bivariate heritability (22). For clarity however it is helpful to first outline the univariate GCTA-GREML model, which partitions the phenotypic variance of a single trait (VarP) into that explained by additive genetic effects captured by SNPs (referred to as the genetic variance [VarG]), and residual variance (VarE), in order to estimate the SNPbased heritability of the phenotype (VarG/VarP). Specifically, this is the narrow-sense heritability (denoted h 2 ), because the numerator includes only the phenotypic variance explained by additive genetic effects. Note S13 and Table S14 enable comparison of our primary bivariate GCTA-GREML results to univariate GCTA-GREML results from the same sample.
Univariate and bivariate GCTA-GREML have been described in full in the original methods papers (23,24); here we adapt the notation and description of the models from Deary et al. (25).
In univariate GCTA-GREML the variance of the phenotype is partitioned using a linear mixed model: where is the phenotypic value, is the intercept, and are the coefficient and indicator variable respectively for a covariate modelled as a fixed effect, is the additive genetic value (the sum of the additive effects of all SNPs) for the individual, modelled as a random effect with ~(0, ) and is the residual with ~(0, ). Under this model the phenotypic covariance between pairs of distantly related individuals is modelled as a function of their genetic similarity: ( where is the genetic relationship between individuals and estimated from the SNP data.
Equation 1 can be re-written in matrix form as: where = { } × is a vector of phenotypic values with being the sample size, = { } ( )× is a vector of fixed effects (with being the number of fixed effects), = { } ×( ) is an incidence matrix for the fixed effects, = { } × is a vector of additive genetic values and = { } × is a vector of residuals. The variance-covariance matrix is then: where = { } × is the genetic relatedness matrix (GRM) and is an × identity matrix. It has been shown previously (23) that the model specified by Equation 4 is equivalent to a model fitting the effects of all the SNPs together: where = { } × is an incidence matrix for the SNPs; with denoting the genotype for the individual at the SNP, coded 0, 1 or 2 corresponding to genotypes bb, Bb and BB, and denotes the allele frequency of the B allele at locus ; is an × 1 vector of SNP effects, ~( , ). These SNP effects are assumed to be uncorrelated because the model is equivalent to fitting all SNPs jointly (20). Equation 8 standardises the genotypes such that they have mean zero and unit variance. We define The mean (GRM off-diagonals) is zero, and in the absence of inbreeding the mean (GRM diagonals) is unity. Since the additive genetic value is the sum of the SNP effects, The variance-covariance matrix will then be: We fitted this model using the GCTA software, which implements the average information (AI) algorithm (26) to give restricted maximum likelihood (REML) estimates of and .
GCTA-GREML can be extended to a bivariate model (24) to partition the covariance of two phenotypes: = + + for trait one (maternal BMI), and Equation 12 = + + for trait two (offspring phenotype), Equation 13 with ~( , ), ~( , ), the distributions of and specified similarly, and notation as for Equation 5 above with the subscripts "1" and "2" denoting the two traits. The phenotypic covariance between pairs of distantly related individuals for trait one, trait two and between traits one and two is then modelled as a function of their genetic similarity: where and denote the genetic and residual covariance between the two traits respectively. In matrix notation the variance-covariance matrix can be written as: We fitted this bivariate model using the GCTA software (26), to obtain REML estimates of variance and covariance components.
Note S10: Sensitivity analysis 1-Phenotypic variable transformations We investigated whether CovG:CovP estimates were sensitive to the transformation used to normalise skewed variables (all phenotypic variables aside from birth weight (BW) were positively skewed). Four transformations were investigated: 1. Standardisation only (i.e. no normalising transformation) 2. Natural log followed by standardisation 3. Rank based inverse normal transformation 4. Z-score calculated using the UK-WHO Child Growth Reference (for weight at birth [standardised for sex and gestational age], and for BMI from age 1 to 15 years [standardised for sex and age at BMI measurement]) Figure S11 shows that normalising transformations made little difference to the CovG:CovP estimates; we have therefore presented the analyses with no normalising transformation as the primary results. Note S12: Source of other variables As part of Sensitivity analysis 3 (Notes/Tables/ Figures S20-S24) we included potential confounders of the maternal BMI-offspring BMI association as fixed effects whilst fitting the bivariate GCTA-GREML model. Details of the variables used in this analysis are given immediately below.

NFBC1966
Questionnaires were filled in by mothers during pregnancy between 24 and 28 weeks gestation and trained midwives recorded prospective data on the pregnancies on a structured study form from the 16 th gestational week onwards using the maternity records. Gestational age and maternal age at delivery were from the maternity records; gestational age was calculated based on the last menstrual period. Maternal age at delivery was calculated from offspring date of delivery and maternal date of birth. Maternal parity, smoking and socioeconomic position (SEP) variables were derived using data from the pregnancy questionnaire; parity was defined as the number of previous pregnancies resulting in live or stillbirth. Using data on smoking behaviour in the second month of pregnancy a maternal smoking variable was derived with the following categories: non-smoking, light to moderate smoking (10 or less cigarettes per day) or heavy smoking (greater than 10 cigarettes per day). Socioeconomic position (SEP) in pregnancy was defined based on paternal occupation and grouped into the following categories: class I (professionals), class II (professionals, lower), class III (skilled workers), class IV (unskilled workers and no occupation), farmers with land >= 8 hectares and farmers with land < 8 hectares.

NFBC1986
Questionnaires were filled in by the mothers during early pregnancy (at 12-16 weeks gestation) and late pregnancy (from 24 weeks gestation onwards). Gestational age was recorded on a structured study form by the midwives attending the birth, and was calculated from early pregnancy ultrasound scan or the date of the last menstrual period. Maternal age at delivery was calculated from maternal date of birth and offspring date of delivery; maternal parity data were from the early pregnancy questionnaire. Maternal SEP data were from the late pregnancy questionnaire; SEP during pregnancy was defined based on paternal occupation and grouped into the following categories: professional/entrepreneur, skilled worker (non-manual), skilled worker (manual), unskilled worker/apprentice, farmer, student/at home and sick pension/unemployed. Using questionnaire data on smoking behaviour after the second month of pregnancy a maternal smoking variable was derived with the following categories: non-smoking, light to moderate smoking (10 or less cigarettes per day) or heavy smoking (greater than 10 cigarettes per day).

ALSPAC
Parity was reported by the mothers in a questionnaire completed at 18 weeks gestation. SEP during pregnancy was derived from data on maternal and paternal occupation reported by the mother in a questionnaire completed at 32 weeks gestation; the highest occupational group of the mother or father was used to give a variable with the following categories: class I (professional occupations), class II (managerial and technical occupations), class III (skilled manual occupations), class IV (partly skilled occupations) and class V (unskilled occupations). Data on maternal smoking during pregnancy were from questionnaires completed by the mother at 18 and 32 weeks gestation; a variable was created with three categories: never smoked during pregnancy, smoked in early pregnancy only and smoked throughout pregnancy. Maternal age at delivery was calculated from maternal date of birth and offspring date of delivery. Data on gestational age at delivery were obtained from birth records, and consistent with UK clinical practice at the time will have largely been based on the date of the mothers' last menstrual period, with some potential modification in those who had an ultrasound scan in the first trimester or on clinical assessment at birth.
Note S13: Univariate GCTA-GREML estimates of SNP heritability For our primary analyses, we used bivariate GCTA-GREML to estimate SNP-based bivariate heritability (CovG:CovP) for maternal BMI and offspring phenotypes, as explained in the Main Text and Note S9. It is also possible to fit a univariate GCTA-GREML model, in order to estimate SNP-based univariate heritability (i.e. VarG/VarP, where VarG is the phenotypic variance explained by the additive genetic effects captured by imputed SNPs, and VarP is the phenotypic variance). Estimates of SNP-based univariate heritability (denoted ℎ ) are presented below for the same offspring phenotypes that we investigated in our primary analyses. Comparison of the estimates from Table S14 with our main results (Figure 2 in Main Text) illustrates the difference between bivariate SNP heritability and univariate SNP heritability. If univariate SNP heritability for maternal and offspring phenotypes is substantial, and the maternal and offspring phenotypes are genetically correlated (Note S15), this would suggest that the covariance between maternal and offspring phenotypes is due in part to genetic effects. However, our bivariate SNP heritability results go further than this by empirically quantifying the extent to which the maternal-offspring phenotypic covariance is explained by genetic effects captured by SNPs; they also enable comparison of bivariate heritability for different offspring phenotypes (e.g. birth weight versus adolescent BMI). Table S14: Univariate GCTA-GREML estimates of , from the combined cohorts. 95% confidence intervals were calculated as the point estimate +/-1.96 x SEGCTA, where SEGCTA denotes the standard error as supplied by the GCTA software. Phenotypes were treated as for the primary bivariate GCTA-GREML analyses, 20 principal components, cohort, offspring sex and age at phenotype measurement (replaced with gestational age at birth for BW models) were included as fixed effects, a relatedness exclusion threshold of 0.05 was applied, and the "--reml-no-constrain" option was used Note S15: Relationship of bivariate heritability to genetic correlation Genetic correlation (rG) is defined as the correlation between the additive genetic values for two traits (with additive genetic value defined as the sum the additive effects of an individual's alleles). rG is a measure of the extent to which the additive genetic effects are shared by the two phenotypes (with the caveat that rG may also be due to assortative mating or indirect genetic effects), and is calculated by the equation: where and are the parts of the phenotypic covariance and variance respectively which are explained by additive genetic effects, and the subscripts "1" and "2" denote phenotypes one and two.
Bivariate heritability (CovG:CovP) is a measure of the extent to which the total phenotypic covariance is explained by additive genetic effects, and is a function of rG, the genetic variance of both traits, and the phenotypic covariance.
Note S16: Sensitivity analysis 2-SNPs used to calculate the genetic relatedness matrix For our primary analyses we used SNPs with minor allele frequency (MAF) > 1%, imputation r 2 > 0.3 and Hardy-Weinberg equilibrium (HWE) P value > 1e-6. We investigated whether SNP heritability (i.e. the proportion of phenotypic variance explained by additive genetic effects captured by genome-wide imputed SNPs [ℎ ]) and CovG:CovP estimates were sensitive to the MAF and imputation r 2 thresholds applied to select the SNPs from which the genetic relatedness matrix (GRM) was calculated. Four sets of SNPs were used (for all sets a HWE P value threshold of 1e-6 was also applied): 1. Common imputed SNPs: MAF > 1%, imputation r 2 > 0. 3 2. High quality imputed SNPs: minor allele count (MAC) > 5, imputation r 2 > 0.99 3. Array SNPs: only SNPs that were directly genotyped on the genome-wide microarray 4. All imputed SNPs with MAC > 5, imputation r 2 > 0. 3 We fitted models separately for the individual cohorts and compared weighted mean estimates across all cohorts and phenotypes (weighted by sample size). Table S17 and Figures S18 and S19 show that mean estimated ℎ was somewhat higher when using all SNPs with MAC > 5 and r 2 > 0.3 (consistent with previously published findings (27)), and somewhat lower when using high quality imputed SNPs, than when using common imputed SNPs or array SNPs. However, in bivariate analyses CovG:CovP estimates did not differ markedly when using high quality imputed SNPs versus common imputed SNPs, although CovG:CovP estimates were somewhat higher when using all SNPs with MAC > 5 and r 2 > 0.3. These results provide reassurance that the estimates from our primary analyses (using SNPs with MAF > 1% and r 2 > 0.3) would not have been markedly different had we used only high quality imputed SNPs, and are conservative relative to estimates using all SNPs with MAC > 5 and r 2 > 0.3.  Figure S18: SNP heritability estimates at varying SNP minor allele frequency and imputation quality thresholds. Black crosses represent the weighted mean of the estimates across cohorts and phenotypes (weighted by sample size) Figure S19: CovG:CovP estimates for maternal BMI and offspring phenotype, at varying SNP minor allele frequency and imputation quality thresholds. Black crosses represent the weighted mean (weighted by sample size) of the estimates across cohorts and phenotypes, black stars represent the weighted mean of the absolute values of the estimates across cohorts and phenotypes. The absolute value of a number is its non-negative value without regard to its sign

Note S20: Sensitivity analysis 3-Number of principal components and other covariates
We investigated whether ℎ and CovG:CovP estimates varied as the number of ancestry informative principal components (PCs) that were fitted as fixed effects was varied between zero and one thousand. Figures/Tables S21-23 show that there was a slight upward trend in ℎ estimates as the number of PCs was increased from zero to one hundred, beyond which estimates became somewhat unstable. A gradual downward trend in ℎ estimates as more PCs are fitted might indicate inflation of ℎ estimates due to residual population stratification uncontrolled by PCs (28). We saw no such downward trend, providing reassurance that our results are not substantively biased by population stratification.
For the pooled IPD estimates we also investigated whether including different covariates as fixed effects impacted our ℎ and CovG:CovP estimates. We fitted three models:    Figure S23: CovG:CovP estimates for maternal BMI and offspring phenotype, with varying numbers of principal components fitted as fixed effects, in the combined cohorts (pooled IPD estimates). Black crosses represent the weighted mean (weighted by sample size) of the estimates across phenotypes, black stars represent the weighted mean of the absolute values of the estimates across phenotypes. The absolute value of a number is its non-negative value without regard to its sign

Note S25: Sensitivity analysis 4-Relatedness exclusion threshold
We investigated whether estimates of ℎ and CovG:CovP were sensitive to the relatedness threshold used to exclude cryptically related individuals from the sample. If estimates increase on relaxation of the relatedness threshold this could suggest that estimates at less stringent thresholds are biased upwards due to confounding of the familial environment with genetic similarity. Figure S26 and Table S27 show that there was a slight upward trend in ℎ estimates as the relatedness threshold was relaxed from 0.025 towards 0.2 (i.e. to allow more closely related individuals to remain in the sample). Figure S28 and Table S29 show that the weighted mean of the CovG:CovP estimates was somewhat lower at relatedness threshold 0.025 compared to less stringent thresholds, and that this was partly driven by more extreme negative CovG:CovP estimates, such that the weighted mean of the absolute values of CovG:CovP was similar at all thresholds. We conclude that 1. It is possible that there is a small amount of inflation of our CovG:CovP and ℎ estimates due to cryptic relatedness 2. Because of the markedly lower sample size at relatedness threshold 0.025 it is difficult to distinguish any true inflation effects from chance differences due to sampling error. The absence of a smooth upward trend in estimates on relaxation of the relatedness threshold for most phenotypes suggests that sampling error is an important cause of differences in estimates between relatedness thresholds 3. Taken together with the results of a test which uses genome partitioning and showed no evidence of ℎ inflation due to cryptic relatedness or population stratification at relatedness threshold 0.05 (Note S34 and Table S35), these results provide reassurance that any inflation of our estimates due to cryptic relatedness, if present, is likely to be minor. Figure S26: SNP heritability estimates at varying relatedness thresholds for all phenotypes, in the combined cohorts (pooled IPD estimates). Black crosses represent the weighted mean of the estimates across phenotypes (weighted by sample size).  Figure S28: CovG:CovP estimates for maternal BMI and offspring phenotype, at varying relatedness thresholds, in the combined cohorts (pooled IPD estimates). Black crosses represent the weighted mean (weighted by sample size) of the estimates across phenotypes, black stars represent the weighted mean of the absolute values of the estimates across phenotypes. The absolute value of a number is its nonnegative value without regard to its sign  Figures S31-S33 show that there were not large differences in CovG:CovP for weight, BMI and PI. Importantly, our main conclusions would have changed little had we used BMI or PI as the primary phenotype at birth instead of BW. Figure S31: Partitioned covariance between maternal BMI and offspring weight at all ages in the combined cohorts (pooled IPD estimates). W: weight at the indicated age (years) Figure S32: Partitioned covariance between maternal BMI and offspring BMI at all ages in the combined cohorts (pooled IPD estimates). BMI: body mass index at the indicated age (years) Figure S33: Partitioned covariance between maternal BMI and offspring ponderal index at all ages in the combined cohorts (pooled IPD estimates). PI: ponderal index at the indicated age (years) Note S34: Testing for inflation of heritability estimates due to population structure We partitioned ℎ by chromosome (29), as proposed by Yang et al. (30), to test for inflation of ℎ estimates due to population structure (cryptic relatedness and population stratification). In brief, this involves splitting the genome into disjoint portions (in our case approximate halves [chromosomes 1-7 and 8-22]) and estimating ℎ from each of these disjoint halves separately (h 2 CHR 1-7 and h 2 CHR 8-22). Inflation of ℎ estimates due to population structure is estimated by the difference between the sum of ℎ estimates from the disjoint halves and ℎ estimated from the whole genome by fitting both halves jointly (h 2 JOINT). Table S35 shows that estimated ℎ due to population structure was negligible for the combined cohorts (pooled IPD estimates), providing reassurance that our results are unlikely to be substantively biased by cryptic relatedness or population stratification. Note S36: Leave-one-out jackknife procedure We used a leave-one-out jackknife procedure (31,32) to calculate standard errors (SEs) and 95% confidence intervals (CIs) for CovG, CovP and CovG:CovP because to our knowledge, no formula is available for the SE of CovG:CovP, and non-parametric bootstrapping is not possible in this setting a . We first calculated N estimates of the parameter of interest (CovG, CovP or CovG:CovP) for each of the samples of size N -1 obtained by omitting the N th observation; these are referred to as partial estimates, denoted . We then calculated pseudovalues, denoted * , from each of the partial estimates as * = − ( − 1) . We calculated the jackknife SE, denoted SEJACK, as the standard deviation of the pseudovalues: a Non-parametric bootstrapping involves sampling with replacement, which is not possible for GCTA-GREML analyses since individuals present more than once in a bootstrap sample would be genetically equivalent to monozygotic twins, therefore would be removed on application of a relatedness exclusion threshold to the GRM.
where • * is the mean of the pseudovalues. Finally, we calculated 95% CIs as the point estimate from the entire sample ± 1.96 * SEJACK.

Note S37: Validation of jackknife procedure via simulation
We conducted a simulation to establish that confidence intervals (CIs) for a ratio of covariances calculated using a leave-one-out jackknife approach are likely to have good coverage properties in our case. The simulation involved four steps: 1. Simulation of a population of ten million mother-offspring dyads. Each dyad had two normally distributed genetic values and two normally distributed environmental values (i.e. a maternal genetic value, a maternal environmental value, an offspring genetic value and an offspring environmental value). We simulated this random vector of genetic and environmental values = [ , , , ] from a multivariate normal distribution: , and , denote genetic and environmental values for the mother and offspring respectively and variance and covariance values were chosen to give values of CovP, CovG:CovP and ℎ similar to our empirical results (Table S38). Two phenotypic values (one maternal value, one offspring value) were calculated for each dyad as the sum of the relevant maternal or offspring genetic and environmental values. Four sets of simulated parameters were used (Table S38) 2. Selection at random of m = 5000 samples, of size n = 250 b , from the simulated population 3. Calculation of CovG, CovP and CovG:CovP for each sample (using Equation 1 in Note S9), along with 95% CIs for each parameter using a leave-one-out jackknife procedure (Note S36) 4. Calculation of CI coverage as the proportion of the m = 5000 confidence intervals that included the true population value Simulation results (Table S38) suggested that confidence intervals from the leave-one-out jackknife have good coverage properties, including for a ratio of covariances such as CovG:CovP. In the simulation we also compared CI coverage to that from a non-parametric bootstrap approach (bias corrected and accelerated [BCa] bootstrap), and found that CI coverage was extremely similar for the jackknife and bootstrap (data available on request). We were able to use the bootstrap approach for our analyses of simulated data, but not for our b Sample size was chosen in order to give standard errors for CovG of a similar magnitude to those from our primary GCTA-GREML analyses.
primary empirical analyses, because in the simulation we used Equation 1 (Note S9) to calculate the genetic covariance, whereas in the empirical analyses we used GCTA-GREML.

Note S41: Standard meta-analysis results
For the primary analyses we merged individual participant data (IPD) from the three cohorts and fitted the GCTA-GREML model on the pooled dataset. In the meta-analysis literature this is referred to as one-stage IPD meta-analysis (33), and has also been referred to as megaanalysis, however for simplicity we use the term "pooled IPD estimates" here. We standardised phenotypic variables in the combined cohorts after merging. These pooled IPD estimates had greater statistical efficiency than a standard meta-analysis in which the GCTA-GREML model is fitted separately for each cohort, and the pooled effect is then estimated using a fixed or random effects model, but assumed that the three cohorts were from the same population. As a sensitivity analysis we therefore conducted a standard meta-analysis: we calculated estimates within each cohort separately (having previously carried out standardisation of phenotypic variables within separate cohorts), followed by estimation of the pooled effect using both fixed (inverse variance weighted) and random effects (DerSimonian and Laird (34)) models, implemented with the metan command (35) in Stata. These standard meta-analyses tested whether results were sensitive to (1) standardisation of phenotypes within cohorts separately, and (2) for the random effects model, relaxation of the assumption that the three cohorts are from the same population. Table S42 shows heterogeneity statistics from the standard meta-analysis. There was marked heterogeneity for CovP at most ages, which may have been due to differences in the strength of the obesogenic environment between cohorts, but less evidence of heterogeneity for CovG and CovG:CovP, particularly at birth, 10 years and 15 years. Tables S43 and S44, and Figures S45  and S46 show that results from the standard meta-analyses using both the fixed and random effects models were similar to our primary pooled IPD estimates, albeit with wider confidence intervals. The similarity of pooled IPD estimates to those from the random effects standard meta-analysis suggests that the former are robust to relaxation of the assumption that the three cohorts are from the same population, and to standardisation of phenotypic variables separately within cohorts. Table S47 gives results from the separate cohorts, which were input into the standard meta-analysis.    Figure S45: Standard meta-analysis results for phenotypic covariance (CovP), genetic covariance (CovG) and CovG:CovP between maternal BMI and offspring phenotype for the three cohorts, from the fixed effects model (inverse variance weighted) Figure S46: Standard meta-analysis results for phenotypic covariance (CovP), genetic covariance (CovG) and CovG:CovP between maternal BMI and offspring phenotype for the three cohorts, from the random effects model (DerSimonian and Laird) Missing data caused the samples used for our primary GCTA-GREML analyses to be smaller than the full sample of live born singletons at baseline (Note S3, Figure S4). A substantial proportion of individuals (particularly in the NFBCs) had missing genotype data ( Figure S4), and some individuals with genotype available had to be excluded due to cryptic relatedness (Table  S50). Furthermore, although phenotype data were available for most individuals at birth, the proportion with missing phenotype data increased with age. It is possible that missing data could result in selection bias, for both the phenotypic and genetic covariance. To explore the potential for selection bias we examined the phenotypic associations between maternal BMI and offspring BW in the samples of live born singletons at baseline, and compared this to the same association in the samples used for the GCTA-GREML analyses (pooled IPD estimates).
Supplementary Table S49 shows that there were not substantial differences in the phenotypic associations between the baseline samples and the GCTA-GREML samples, suggesting that selection bias is unlikely to be large, at least for the phenotypic associations. We were unable to perform similar comparisons for the genetic covariance or for the phenotypic covariance at later ages, because due to the missing data it would be impossible to estimate these parameters in the baseline sample. Table S49: Results from linear regression of offspring BW (z-score) on maternal BMI (z-score), for the full sample of live born singletons at baseline with non-missing data for maternal BMI and offspring BW, and for the samples used for the primary GCTA-GREML analyses. Both variables were standardised, therefore the regression coefficient is equivalent to both Pearson's correlation coefficient and to the covariance