Functionally oriented analysis of cardiometabolic traits in a trans-ethnic sample

Abstract Interpretation of genetic association results is difficult because signals often lack biological context. To generate hypotheses of the functional genetic etiology of complex cardiometabolic traits, we estimated the genetically determined component of gene expression from common variants using PrediXcan (1) and determined genes with differential predicted expression by trait. PrediXcan imputes tissue-specific expression levels from genetic variation using variant-level effect on gene expression in transcriptome data. To explore the value of imputed genetically regulated gene expression (GReX) models across different ancestral populations, we evaluated imputed expression levels for predictive accuracy genome-wide in RNA sequence data in samples drawn from European-ancestry and African-ancestry populations and identified substantial predictive power using European-derived models in a non-European target population. We then tested the association of GReX on 15 cardiometabolic traits including blood lipid levels, body mass index, height, blood pressure, fasting glucose and insulin, RR interval, fibrinogen level, factor VII level and white blood cell and platelet counts in 15 755 individuals across three ancestry groups, resulting in 20 novel gene-phenotype associations reaching experiment-wide significance across ancestries. In addition, we identified 18 significant novel gene-phenotype associations in our ancestry-specific analyses. Top associations were assessed for additional support via query of S-PrediXcan (2) results derived from publicly available genome-wide association studies summary data. Collectively, these findings illustrate the utility of transcriptome-based imputation models for discovery of cardiometabolic effect genes in a diverse dataset.


Introduction
Cardiometabolic traits are highly heritable, with heritability estimates ranging from ∼0.3 to 0.8 (3)(4)(5). Through genomewide association studies (GWAS) and exome-chip based association studies, genetic researchers have identified more than a thousand common and low-frequency variants contributing to the risk of cardiometabolic traits to date, yet only a small fraction of the heritability of these traits [e.g. 27% for height on the high end (6)], has been explained. Furthermore, the molecular mechanisms through which identified cardiometabolic variants confer risk are largely unknown. Some relevant molecular pathways have been identified via gene set enrichment and gene ontology analyses, often using the nearest (in linear distance) genes (6)(7)(8)(9)(10)(11); however, signals often occur far from a protein coding region or functional element in the genome leading to weak assumptions about the relevant biology. More recently, interrogations of lowfrequency coding variation via exome chip analysis identified a handful of novel loci, but minimal additional insight into the potential functionality of previously known common loci was gained from these findings (6,10).
Regulation of gene expression is influenced by many factors including tissue or cell type, developmental or life stage, environmental exposures, and genetic factors (12)(13)(14)(15)(16) such as promoter and enhancer sequence variation (17). Expression levels of certain genes may be a predictor of future disease or a biomarker of current disease (18,19), and many noncoding variants are associated with RNA abundance [i.e. expression quantitative trait loci (eQTLs)] and multiple eQTLs may jointly genetically regulate expression of a particular gene. Recent studies suggest that top hits in many GWAS for complex traits are enriched for eQTLs, allowing for interpretation of GWAS signals in the context of their role in gene expression regulation (20)(21)(22)(23). By aggregating variant effects at the gene level, by design, results of genetically regulated gene expression (GReX)-based analyses are directly interpretable within a functional biological framework and also benefit from reduced multiple testing correction. As in traditional genome-wide association studies, GReX-based analyses are hypothesis generating rather than prescriptive, and are also susceptible to synthetic associations due to co-regulation of genes. However, in some instances distinct expression regulation at neighboring genes can be leveraged to target putatively functional genes in disease etiology, generating testable hypotheses of causality. Further, GReX-based genome-wide association tests may avoid some of the pitfalls of measured gene expression at a single time point, because genetically controlled traits often represent life-long influences, without the environmental, batch and temporal effects captured by measured gene expression (24).
Methodologically, two approaches to GReX estimation have been proposed: elastic net regression implemented in PrediXcan (1), and Bayesian Sparse Linear Mixed Models (BSLMM) implemented in TWAS, a transcriptome-wide association study approach (25). In PrediXcan, GReX panels are built with elastic net regression using cross-validation to prevent model overfitting (1); elastic net regression is an excellent variable selection method for 'wide' datasets with many predictors/features (26) and a sparse architecture, where a limited number of features have relatively large effect sizes. The TWAS-implemented BSLMM approach for GReX estimation assumes a mixture of both sparse and broad polygenic effects (25)-allowing for many small effect sizes distributed across many predictors. Comparisons of these two approaches indicate that on the tissue level, gene expression traits appear to be driven by a set of sparse, large effect variants (27). Additional analyses also illustrate that interactions among eQTL variants from lymphoblastoid cell lines likely do not influence gene expression traits to any strong degree (28), though this hypothesis remains to be tested in primary tissues. The expression imputation models we employ in PrediXcan were created by jointly estimating the effects of genomic variants on transcriptome data in large datasets such as the Genotype-Tissue Expression project (GTEx v6p) (29,30) and Depression Genes and Networks (DGN) (31) study. Using PrediXcanestimated GReX on the large collection of cardiometabolic GWAS studies will achieve two simultaneous goals: (1) the aggregation of multiple SNP effects into a single functional unit increases statistical power for discovery and (2) the detection of a significant result generates a hypothesis that a specific gene (and potentially higher biological mechanisms) may be important for cardiometabolic risk.

Model assessment
Existing work has confirmed the applicability of genotype-based gene expression prediction in diverse training or target populations (32); however, because we applied GReX models built from RNA-sequencing data generated from tissues collected from individuals of primarily European descent in datasets drawn from non-European populations, we performed a number of experiments to evaluate the quality of cross-ancestry expression prediction in available data. We directly assessed the accuracy of PrediXcan-estimated GReX in our study data by comparing imputed whole blood expression and measured WB RNA sequencing levels in a subset of 175 European-ancestry ARIC samples. Using whole blood DGN (DGN-WB) imputed GReX, the squared correlation coefficient of the directly measured RNA expression with PrediXcan-derived GReX deviated significantly from the expected R 2 distribution at the tail end of the observed distribution, indicating the predictive utility of these externallybuilt models in our samples (Supplementary Material, Fig. S1). An additional comparison to the gene's heritability, as assessed in the data on which the PrediXcan models were trained, is made in the same figure (1). Because our imputations are estimating GReX, not RNA abundance, the heritability of the expression for that gene sets an upper bound for how well we can impute its expression using genetic variation. Imputed expression in the ARIC samples captured up to 75% of the variance in the observed expression. Notably, we found a significant correlation between the cross-validation R 2 from the model building step (trained in the European-ancestry DGN-WB samples) and the explained variance (in measured expression captured by GReX) in the ARIC samples (Spearman correlation = 0.70, P < 2.2e-16). Top results are annotated with the ARIC R 2 /h 2 values (Table 1 and Supplementary Material, Table S1).
Using the same tissue (whole blood), we found significant correlations (P < 2.2e-16; Supplementary Material, Fig. S2) of z-score for DGN-and GTEx-trained models for each study phenotype, indicating that association results obtained from the application of models derived from the ancestrally heterogeneous GTEx-WB and the homogenous (European ancestry) DGN-WB showed a high degree of concordance. To verify the robustness of our trans-ethnic analysis, we conducted additional analyses to assess the utility of PrediXcan models developed in DGN and GTEx in non-European populations, using African ancestry YRI LCL samples. We find that imputation quality (cross-validation R 2 ) in European ancestry DGN-WB was significantly associated (P = 0.007) with imputation performance in the African ancestry YRI samples, though that correlation between imputed and measured expression is lower, on average, in the African population (Supplementary Material, Fig. S3). We also observe that among models with high R 2 (presumed to be well modeled based on the training data), we see highly significant correlation of imputed and measured expression, with a substantial number of genes with correlation P < 1e-3 (Supplementary Material, Fig. S4). In summary, although we detect that predictive power is slightly reduced, PrediXcan retains substantial predictive power when European-derived models are applied to non-European samples. Until more equal representation of non-European populations in publicly available gene expression datasets is attained and ancestryspecific expression prediction models generated for the subset of genes with ancestry-specific genetic regulation, application of GTEx derived models is likely to retain power to detect effects at most genes in most tissues in datasets drawn from non-European populations.  Table S1). We detected many genes associated with white blood cell (WBC) count in the well-known DARC locus, however, due to extensive linkage disequilibrium, all further results exclude genes in this region (spanning 1q21.1 to 1q23.3). Across the 15 traits, GReX of 167 unique gene-phenotype associations (270 gene-phenotype-tissue associations) were significant (experiment-wide BH-adjusted P < 0.05) in at least one biologically relevant tissue in ancestry-specific analyses or trans-ancestry meta-analysis (Supplementary Material, Fig. S5). Thirty-eight of these are novel, significant genetrait associations including 20 experiment-wide significant trans-ancestry results and 18 such ancestry-specific findings ( Table 1). Of our 270 total significant tissue-specific gene-trait associations, Gene2Pheno results were available for 126 known and 13 novel gene-trait-tissue combinations; 122 known (96%) and 2 (13%) novel genes were associated with the same trait in the same tissue(s) (P < 3.60e-4, based on Bonferroni correction for the number of Gene2Pheno genes queried).

Gene-trait associations
In height, we observed four novel gene-trait associations: LGR6 in DGN-WB and GYPE in visceral omentum adipose tissue in the trans-ancestry meta-analysis, SH3BGTL2 in visceral omentum adipose in European ancestry, and RCBTB1 in adrenal gland in African Americans. GYPE was also significant in Gene2Pheno (P = 2.6e-05). We observe novel associations between fasting insulin levels and PHOSPHO1 in skeletal muscle and DGN-WB (European ancestry), and FAM175A in DGN-WB. Five genes not previously associated in GWAS of serum lipid levels were identified by trans-ancestry meta-analysis: ZNF441 (LDL, pancreas); TAF6L (triglycerides, WB); APOL5 (triglycerides, visceral omentum adipose); C2orf70 (triglycerides, pancreas); and PCBP4 (triglycerides, liver). Additional European-specific associations were observed, including BTBD3 for LDL and cholesterol in WB and DGN-WB; and RBKS for triglycerides in skeletal muscle. TAF6L reached significance in Gene2Pheno; results for C2orf70 and PCBP4 were not available in Gene2Pheno. Despite being a compelling candidate, APOL5 was not significant in Gene2Pheno results for association with triglycerides in our discovery tissue (visceral omentum adipose, P = 0.094).
Novel findings among blood and vascular traits were discovered in trans-ancestry analyses of factor VII, DBP, fibrinogen, platelet count and WBC count; African-American-specific analyses for SBP, platelet count and WBC count (Table 1). Results for WBC count were strongly enriched in the DARC locus, with 57 unique gene-phenotype associations meeting our significance threshold, primarily driven by African Americans (Supplementary Material, Table S1) (33). Three regions had significant GReX associations with factor VII level in the transancestry meta-analysis and were not observed in prior GWAS, all associated in multiple tissues. TSKU was significantly associated in aortic artery, tibial artery and left ventricle tissue, genes in 12.q12 (including GXYLT1) in aortic artery, tibial artery and DGN-WB, and EXOC4 in DGN-WB and WB. TSKU was previously implicated in a sub-analysis of a recent GWAS, when analyses were restricted to only factor VII activity (excluding factor VII antigen) (34).

Pathway analysis
In total, 50 annotation terms were significantly over-represented, most of which were identified for factor VII and WBC count (Supplementary Material, Table S3). Total cholesterol, HDL cholesterol, LDL cholesterol and triglycerides also each had at least one term over-represented in the corresponding candidate gene sets. The most significant terms for these traits are early endosome (cholesterol), alcohol dehydrogenase activity (FVII), very-low-density lipoprotein particle remodeling (HDL), lipid metabolism (LDL), lipoprotein metabolic process (triglycerides) and immunoglobulin domain (WBC).
Among the novel genes identified for any trait, there were four that showed evidence of protein-protein interaction with known genes for that trait in STRING (Supplementary Material, Table S4). STRING considers evidence from seven sources in establishing a functional association score. For platelet count, N.SNPs is the number of SNPs used to predict GREx; P-value (MA) is the meta-analysis across all three ancestires; P-value (EA) is the European specific GREx by trait association p-value; P-value (AA) is the African American specific GREx by trait association p-value; P-value (HL) is the meta-analysis p-value across the three Hispanic/Latino studies; Qval is adjusted for false discovery rate, from PrediXcan model; Pearson's r2/h2 is squared correlation coefficient between predicted and observed whole blood (DGN) expression / PrediXcan local heritability estimate ; NR indicates no results were available for the gene-tissue combination in Gene2Pheno; TNA indicates that the traits was not available in a Gene2Pheno database.
TP53 is predicted to interact with known-gene BAK1 with a combined score of 0.993, with strong evidence of interaction from the expertly curated database score and being frequently co-mentioned in PubMed abstracts. For body mass index (BMI), NUP107 shows evidence of interaction with known-gene NUP160 with a score of 0.999, based on experimental biochemical evidence and correlation of gene expression, as well as the databased and co-mentioned scores. For factor VII, GXYLT1 has a combined interaction score of 0.762 with F7, with most evidence due to being co-mentioned. Finally, for height, LGR6 shows modest evidence of interaction with GNA12, with a score of 0.557 based on experimental evidence and being co-mentioned in PubMed.

Discussion
Approaches that leverage genetically regulated expression prediction, although still fairly new, have been the subject of appreciable criticism. In brief, the primary concerns are that these approaches do not imply causality, are vulnerable to identifying synthetic associations due to co-regulation of genes, can be sensitive to the tissue or expression panel that is used, and ultimately only amount to a weighted burden test. These concerns are valid, but also apply to the most commonly used approach in the past era of complex disease genetics: the genome-wide association study. Just as SNPs identified by GWAS may not in fact be causal, GReX-based approaches are intended to be hypothesis generating, identifying gene candidates whose expression may impact a trait under study; similarly, SNPs in GWAS may be identified due to synthetic associations. GReX-based association analyses are indeed analogous to a weighted burden test-one that uses functional impact on expression as a weight-and so it is logical that they would be sensitive to the same confounding factors as the single variant association statistics that are being essentially combined. As SNPs identified in GWAS must be functionally validated to establish causality, so must genes implicated by PrediXcan. Despite these methodological challenges, for cardiometabolic traits we have identified 38 novel genes in our trans-ancestry and ancestry specific analyses, and replicated known signals at 289 genes. These results represent novel hypotheses of the causal genes underlying new and known GWAS signals. In addition, we explored the utility of GReX prediction models derived from Caucasian datasets in non-Caucasian ancestral populations, and determined that substantial power for GReX estimation is maintained at many genes; further scientific investment in increasing tissue and ancestral diversity of transcriptome datasets will be necessary to capture ancestryspecific eQTL effects.
Among our most biologically interesting novel associations were four genes on chromosome associated with FVII levels. Imputed expression levels of PPHLN1, ZCRB1, YAF2 and GXYLT1 were strongly correlated with each other in whole blood (minimum r 2 = 0.46 and maximum r 2 = 0.83, respectively). GXYLT1 adds the first xylose to O-glucose-modified residues in the epidermal growth factor repeats of proteins. One of these proteins may be factor VII: factor VII does have epidermal growth factor repeats, and is known to be O-linked glycosylated (35). Prior functional work has demonstrated that inducing a mutation affecting the amino acid that is glycosylated reduces the activity of FVII (36).
Other novel associations with compelling functional evidence include VAMP1 predicted expression with fibrinogen levels and PHOSPHO1 predicted expression with fasting insulin.
Members of the VAMP family of proteins are involved in vesicle secretion, including alpha granules, which contain fibrinogen (37,38). The precise function of VAMP1 has not been previously described, but our finding, that GReX is positively associated fibrinogen levels, suggests that like other members of this protein family it may be involved in platelet secretion. Increased PHOSPHO1-imputed expression was associated with higher insulin levels; previous literature reports that decreased methylation status in the body of these gene increases risk of developing type 2 diabetes (39). Additional work shows that methylation is decreased in skeletal muscle samples from T2D cases in comparison to controls (40).
We also provide new support for some genes within known regions, but with weak or population-specific prior evidence. For example, the association between DGN-WB GReX of plateletactivating factor PAFAH1B2 with triglyceride level had strong additional support in Gene2Pheno (replication P = 1.1 × 10 -17 , Table 1). Previous GWAS evidence comes from a Japanese study, in which the identified variant has a MAF of 11% (41). Further, a single rare coding variant in this gene was recently found to have a large effect on triglyceride and HDL levels in European Americans (42). Our findings complement this discovery, expanding the evidence that this gene impacts triglyceride level via common eQTL effects as well as rare functional variation. Because triglyceride level is a causal factor for cardiovascular disease, we explored Gene2Pheno results for PAFAH1B2 in the additive analysis of coronary artery disease in CARDIoGRAM C4D. In DGN-WB, we observe P = 0.014 in Gene2Pheno, with the same magnitude and direction of effect on triglycerides as we observed from PAFAH1B2 in the same tissue, for the first time suggesting that this gene may impact cardiovascular disease risk via effects on triglyceride levels.
For significant known genes, an additional analysis conditional on the known, sentinel variant in the gene or locus was undertaken, when available, to determine independence of the identified GReX signal with the previously identified signal. For 41 tissue-and trait-specific gene associations, including 25 unique genes, the genetically determined expression shows evidence of independent association with the trait of interest, beyond the known single variant signal (conditional P < 0.05). Additionally, three genes, F10 for Factor VII, and SORT1 and SYPL2 for LDL retained P < 2.5e-6 after accounting for the sentinel SNP, demonstrating an independent signal. These results highlight the utility of eQTL evidence to clarify the biology underlying genome-wide association signals, both through identification of novel, independent signals in the same locus and through the refinement of mapping effects. Rather than attributing results simply to the nearest neighbor gene, instead we identify genes where predicted expression is significantly influenced by eQTLs in LD with the association signal.
There are several reasons a particular known locus would not be detected in our analyses: (1) our total sample size is substantially smaller than the large-scale meta-analyses available; therefore, we have less power, despite our reduction in tests.
(2) The biological mechanism underlying previous associations may not be well represented through PrediXcan. For example, changes in amino acid sequence may result in changes in protein function but not expression. (3) Changes in expression may occur in one tissue or cell type that is not captured in the GTEx data. (4) Ancestry-specific associations may act through gene expression changes that are not captured in the predominantly European expression datasets. Despite these limitations, we replicated hundreds of known loci for the 15 cardiometabolic traits under study. Most (289 of 339) of our associations fall within 250 kilobases (kb; or larger in regions of extended LD) of variants previously associated with the same trait; however, only a small fraction of all known loci is detected (e.g. 22 of 157 identified for lipids traits) (11). On the whole, we observe an excess of association signals at genes within 250 kb of previously reported SNPs ( Supplementary  Material, Fig. S6). Just as traditional SNP-based GWAS narrows, often considerably, a disease association to (and thus proposes) a disease-relevant locus (which must subsequently undergo finemapping to identify the causal variants(s)), our approach proposes a disease-relevant gene expression mechanism (whose tissue or indeed cell-type specificity must be further resolved). Therefore, genes identified at known loci in these analyses represent a proposed refinement of the causal biological factor underlying the previously described single variant association signal.
We note that many of our results are observed across several tissues, consistent with some level of shared regulation across tissues (43). To visualize the concordance of significance across tissues for each novel gene, we plotted Z-scores for GReX-trait association across all relevant tissues for each trait ( Fig. 1 and Supplementary Material, Fig. S7). This suggests that genetically regulated expression associated with disease risk may implicate multiple tissues, some or all of which may be relevant to disease pathogenesis. Future in vitro experimentation will be required to understand which tissues' altered expression is responsible for the increased risk.
The Database for Annotation, Visualization and Integrated Discovery (DAVID) analyses provide additional support for novel genes identified by our study, some in regions previously implicated in GWAS, where historically the signals have been attributed to other genes (often through physical proximity rather than through any functional support) (44). Two novel genes identified herein comprise 13 overrepresented pathway results, contributing to the observed overrepresentation of annotation terms (Supplementary Material, Table S3; Benjamini FDR < 0.20). For WBC count, Signal Regulatory Protein Gamma (SIRPG, a transmembrane glycoprotein) was annotated with nine overrepresented annotation terms, including several related to immunoglobulin, disulfide bond, signal peptide and extracellular domain. FCRL1, a known gene for WBC count, was highly significant (p BH = 4.83E-10) in our African ancestry sample and annotated with similar annotation terms (Supplementary Material, Table S3), illustrating novel trait-associated genes (each in a distinct locus) with highly convergent function. One novel gene not previously implicated in GWAS of lipids traits (despite the much larger sample size (11)), APOL5 (from the meta-analysis of triglycerides), is a biologically compelling member of the highdensity apolipoprotein L family which is known to play a major role in cholesterol transport, though is known to have a distinct origin from the APOL1-APOL4 cluster and has not been previously observed in genetic studies (45,46). APOL5 contributed to the overrepresentation of genes involved in lipid and lipoprotein metabolic processes (Supplementary Material, Table S3), which are implicated in known genes for lipids traits. These examples suggest that our findings, collectively, may challenge existing understanding of the relevant genes in known GWAS loci.

Conclusions
Functionally orientating analyses of common variants around gene expression improves biological interpretation and power through reduced multiple testing correction. This approach focuses on cis-acting variants that alter gene expression without the burden of obtaining tissue samples and directly measuring RNA abundance genome-wide. Our imputed-GReX analysis of 15 cardiometabolic traits in 15 755 individuals across three ancestry groups has revealed 38 novel gene-trait associations in new regions, 18 of which were identified in ancestry-specific analyses. Our findings provide potential function-based refinement or support of an additional 289 genes previously implicated in GWAS. Most significant associations reached the Bonferronicorrected significance threshold in Gene2Pheno results, where available, indicating strong replication rate. We found that directly measured WB RNA abundance strongly correlates with imputed expression level at genes with low FDR and that models developed in mostly European ancestry retain predictive power in African samples, providing justification for our trans-ancestry approach. These findings support the validity of the known and novel associations we describe, while highlighting the need for further investment in RNA sequencing data from diverse samples. Furthermore, many of these findings are supported by gene ontology pathway and protein-protein interaction analyses, including novel findings and genes that have not previously been mapped to known GWAS signals. Natural extensions of these methods to include trans-acting eQTLs and other functionally annotated regions of the genome such as those described by ENCODE may provide further power to detect new genes impacting complex disease traits and refine GWAS signals to identify candidates for functional validation. Starr County: Phenotypic measures for 1006 Mexican Americans from Starr County, Texas included in this work were generated in a survey of a representative population of the county. Details of the collection are previously described (48).

Subjects by parent study
Mexico City: Two datasets were available from Mexico City. Both datasets were collected for studies aimed at identifying type 2 diabetes risk factors using genome-wide approaches. The first study includes 967 individuals with type 2 diabetes and 343 controls. The second study comprises 898 individuals with type 2 diabetes and 889 controls. All individuals were included in the present study. Detailed information about the Mexico City datasets has been previously published (21,49).
Additional demographic details for each contributing study are given in the Supplementary Material, Table S2.

Expression imputation
Array-based genotype data was generated for each contributing study following standard laboratory protocols and imputed to 1000 Genomes Project phase 1 or 3 reference data separately using IMPUTE2. The study-specific GWAS array data details, preimputation quality control filters and imputation details are given in the Supplementary Material, Table S5.
SNP imputation data for each study group was filtered prior to imputing expression to retain only SNPs with imputation info scores >0.8 and MAF > 5%. Tissue-specific imputed expression levels were obtained for each individual by applying PrediXcan imputation models (8/18/16 version) built using elastic net (α = 0.5) on the 1000 Genomes phase 1 SNP set in GTEx v6p and DGN consortium whole blood RNA sequencing data. Imputed GReX levels for PrediXcan models with q-value < 0.05 were carried forward for further analysis. For all reported gene-tissue pairs, we verified that they were not flagged as prone to false positives based on updates to the modeling process used in computing GTEx v7 models (list provided by Im et al. and publicly available on predictdb.org); one gene-tissue pair was removed.

Phenotype definitions
Subjects taking lipid-lowering medications were excluded from lipid phenotype association analyses. For subjects from Starr County and ARIC, LDL was calculated from the Friedewald equation (50), with missing values assigned for samples with triglyceride levels >400 mg/dL. For both Mexico City studies, LDL was directly measured (21). Subjects taking medication for hypertension were excluded from blood pressure analyses. Subjects diagnosed with T2D were excluded from fasting glucose and fasting insulin levels; in Starr County individuals with incident T2D based on an oral glucose tolerance the same day were included. All other analyses included all samples with phenotype data available. Raw phenotype values for each study were regressed against age, sex and ancestry using the first and second principal components as covariates and residuals were ranked and inverse-normal transformed. Summary statistics for each phenotype are presented in the Supplementary Material, Table S2.

Tissue/trait selections
Phenotypes were grouped into two sets of related traitsmetabolic, including BMI, fasting glucose, fasting insulin, height and lipids traits (HDL, LDL, total cholesterol and triglycerides), and blood and vascular traits, including blood counts (platelets and WBC), blood pressure (DBP and SBP), clotting factor levels (factor VII and fibrinogen) and RR interval. These traits have shared underlying biology and pleiotropic loci have been reported (51,52). To limit multiple testing burden, tissues that have previously shown relevance to multiple traits in each class were selected. Predicted expressions derived from whole blood (GTEx and DGN) and the adrenal gland were used for all traits. For metabolic traits, subcutaneous and visceral omentum adipose tissues, liver, pancreas and skeletal muscle were considered (8,9,21,53). For blood and vascular traits, aortic, coronary, and tibial arteries, and atrial appendage and left ventricle of the heart were used (54)(55)(56)(57). For height, pituitary gland expression was considered; liver expression was considered for fibrinogen and factor VII; and hypothalamus expression was considered for BMI (8,9,56).

Association tests
Association testing of transformed phenotypes with imputed expression levels (GReX) in each study group was performed using generalized linear models in R. Ancestry-specific association results were classified by whether the gene or nearby SNP had previously been implicated in a published GWAS of the trait for each significant and suggestive gene-trait pair. Previouslyobserved status was initially determined by a gene start site within 250 kb of any SNP associated genome-wide with the same trait in the NHGRI-EBI catalog (58). Any potentially novel association was followed-up with searches in PubMed and Google Scholar for any prior evidence of association. To control for false discovery rate given the numerous hypothesis tests used, we employed a Benjamini-Hochberg multiple testing correction for all traits and all tissues, resulting in an experiment-wide significance threshold (59). We note that while less conservative than a Bonferroni approach to multiple test correction, this approach is likely overly conservative due to the invalid assumption of independence of tests; for example, many genes tested do not exhibit tissue-specific expression patterns, and within tissues, many genes are co-expressed.

Trans-ancestry meta-analysis
Within each trait-tissue pairing, we meta-analyzed GReX association results for the three Hispanic studies using an inversevariance approach, as implemented in METAL (60). We also used this approach to meta-analyze all five studies in a transancestry meta-analysis. Significant and suggestive results were first selected from the trans-ancestry meta-analysis findings; any gene-trait-tissue groupings that did not reach these thresholds within the trans-ancestry results were evaluated for single ancestry association significance. QQ plots for the transancestry meta-analysis results for each trait are shown in the Supplementary Material, Fig. S8, including lambda values calculated by taking the median of the chi-squared test statistics over the expected mean for each tissue (range: 0.90-1.12). Metaanalysis results were filtered to determine novelty following the same process used for ancestry-specific results.

Additional support from Gene2Pheno
Where available, we sought additional support via replication of findings in the S-PrediXcan Gene2Pheno databases (2,61,62). S-PrediXcan utilizes SNP-level association summary statistics, in an analytical approach analogous to PrediXcan, to identify gene-level trait associations. Results derived from large, previously published GWAS (2) for several included traits (BMI, height, HDL, LDL, triglycerides, DBP and SBP) are publicly available through Gene2Pheno. Because the ARIC study contributed data to several of the large-scale association studies (accounting for at most 8.1% of the total sample size of these studies) included in Gene2Pheno, this does not represent true independent replication for all traits (noted in Table 1 and Supplementary Material, Table S1). Gene2Pheno results were not available for all tissue-and trait-specific gene associations. Many of the large GWAS in Gene2Pheno were imputed to reference datasets with smaller SNP sets, i.e. HapMap instead of 1000 Genomes; in some cases, the prediction model for a gene/tissue combination was not available in HapMap models (denoted as NR in Table 1 and Supplementary Material, Table S1). For some tissue-and trait-specific gene associations, differences in models used for Gene2Pheno resulted in model quality differences and models with high S-PrediXcan q-values (q > 0.05) were not included in Gene2Pheno. Thus, overall, we were able to attempt replication for 139 gene-trait associations.

Gene ontology and protein-protein interaction analysis
We used DAVID Functional Annotation Tool v6.8 (63) to identify annotations, including Gene Ontology (GO) terms overrepresented (Benjamini FDR < 0.20) in genes with association Benjamini-Hochberg adjusted P < 0.05 in the present study, stratified by trait. We used STRING (64) to search for proteinprotein interactions between known and novel genes for each trait (Supplementary Material, Table S4). Scores can range between 0 and 1, with a higher score indicating more confidence that there is an interaction.

Prediction validation by RNA sequencing
RNA sequencing data for 175 European ancestry individuals from the ARIC study were available as a test study to explore correlation of imputed expression with measured expression in whole blood (using both GTEx and DGN as reference panels). See the Supplementary Note for RNA quantification methods. GReX levels for 12 081 genes were imputed in whole blood using the PrediXcan framework. PrediXcan models with q-values > 0.05 (i.e. genes that were not reliably imputed) were removed, leaving a final set of 7915 genes for DGN-WB and 5632 genes for WB. The observed distribution of Pearson squared correlation coefficients was compared with that expected by chance ( Supplementary  Material, Fig. S1).
We imputed gene expression in the 89 African ancestry 1000 Genomes YRI (Yoruba in Ibadan, Nigeria) samples with available expression measurements (RNA-Seq) in LCLs using GTEx LCLs as the reference panel. We compared this to measured expression using previously described RNA sequencing data (65). For genes with significant Spearman correlation (P < 0.05) of imputed and measured gene expression, we compared the variance in expression explained with the P-value of the correlation. Finally, to assess the effect of differing ancestry in the training population, we compared association results for models trained in an ancestrally homogenous sample (DGN-WB) and those for models trained in GTEx, a sample for which 15% of donors are African Americans (43).
8C, HHSN268201100009C, HHSN268201100010C, HHSN268201100 011C, and HHSN268201100012C). The authors thank the staff and participants of the ARIC study for their important contributions. RNA sequencing was carried out at the Baylor College of Medicine Human Genome Sequencing Center and supported by the National Human Genome Research Institute grants U54 HG003273 and UM1 HG008898.