Inter-individual differences in gene expression are likely to account for an important fraction of phenotypic differences, including susceptibility to common disorders. Recent studies have shown extensive variation in gene expression levels in humans and other organisms, and that a fraction of this variation is under genetic control. We investigated the patterns of gene expression variation in a 25 Mb region of human chromosome 21, which has been associated with many Down syndrome (DS) phenotypes. Taqman real-time PCR was used to measure expression variation of 41 genes in lymphoblastoid cells of 40 unrelated individuals. For 25 genes found to be differentially expressed, additional analysis was performed in 10 CEPH families to determine heritabilities and map loci harboring regulatory variation. Seventy-six percent of the differentially expressed genes had significant heritabilities, and genomewide linkage analysis led to the identification of significant eQTLs for nine genes. Most eQTLs were in trans, with the best result (P=7.46×10−8) obtained for TMEM1 on chromosome 12q24.33. A cis-eQTL identified for CCT8 was validated by performing an association study in 60 individuals from the HapMap project. SNP rs965951 located within CCT8 was found to be significantly associated with its expression levels (P=2.5×10−5) confirming cis-regulatory variation. The results of our study provide a representative view of expression variation of chromosome 21 genes, identify loci involved in their regulation and suggest that genes, for which expression differences are significantly larger than 1.5-fold in control samples, are unlikely to be involved in DS-phenotypes present in all affected individuals.
Understanding the relationship between sequence variation present in human populations (1) and phenotypic diversity is one of the main challenges of functional genomics. Only a fraction of polymorphic variation is likely to be of functional significance, and traditionally, studies have focused on polymorphisms that alter the primary sequence of proteins as prime candidates for functional variation (qualitative changes). Early studies in model organisms as well as in humans have shown that functional protein sequence variants or hypomorphs are indeed a valid paradigm to explain trait differences within populations (2–4). Interestingly, however, numerous recent association studies implicating genetic variation of a gene to a particular phenotype have failed to identify coding SNPs as candidates for the etiological effect, suggesting that other mechanisms are likely to underlie the molecular pathology in these disorders (5–7).
An alternative way in which sequence variation can have a functional impact is by affecting the steady-state level of mRNA molecules of a particular gene in a given cell (quantitative changes). Since the 1970s, it has been suggested that quantitative differences in gene expression might provide a significant source of variation in natural populations, representing an important substrate for evolution and accounting for a considerable fraction of phenotypic diversity (8).
Changes in the dosage of a gene or group of genes have previously been shown to be associated with human disorders such as trisomy 21 and other contiguous gene syndromes (9–12). It is thus likely that differences in gene expression can also explain the population variance of many traits.
High-throughput expression profiling in a number of organisms has revealed that variation in gene expression levels within and among populations is abundant, with a large proportion of genes (20–40% in most studies) showing significant patterns of inter-individual variation (13–16). Although this variation could be of stochastic, environmental or genetic origin, analyses of expression differences in segregating populations (from lower eukaryotes to mammals) have demonstrated that a significant proportion of the variance has a genetic component (16–19). As such, gene expression levels can be considered as a quantitative trait (eQTL) and thus be dissected using standard genetic approaches such as quantitative linkage analysis.
Understanding the pattern of gene expression variation in humans is important, as it is likely to underlie a significant proportion of the risk for many common disorders. In the context of human chromosome 21 (Hsa21), this question is particularly relevant because it could help to understand the relationship between trisomy of Hsa21 and Down syndrome (DS) phenotypes and, in particular, give new insights into the molecular basis of the extensive phenotypic heterogeneity observed in trisomy 21 patients (20).
In our study, we set out to investigate the patterns of gene expression variation in a 25 Mb region of Hsa21, using a high-precision approach combining TaqMan real-time PCR and multiple replicates per sample.
We screened for inter-individual differences in expression in 41 genes using a panel of lymphoblastoid cell lines derived from 40 unrelated individuals from the CEPH collection (21). Genes found to be differentially expressed were further analyzed in 10 three-generation families in order to determine: (i) What fraction of the variation has a genetic component? (ii) Which loci in the genome are associated with these expression traits?
Gene expression variation
Out of 71 Taqman assays initially designed, a total of 41 met our efficiency criteria of 0.95<E<1.05 (10) in RNAs from lymphoblastoid cell lines. These 41 genes were screened for differential gene expression in cDNAs obtained from 40 unrelated individuals, namely the grandparents of 10 CEPH families (Fig. 1). For each individual, we calculated the normalized relative expression for each of the 41 genes as described in Methods section. The median 95% CIs for relative expression measurements was ±0.18, indicating that inter-individual differences as small as ∼1.4-fold could be readily detected for the large majority of assays.
To exclude Epstein–Barr virus (EBV) transformation effects and cell culture conditions as major sources of gene expression variation, we transformed lymphocytes from the same individual six independent times over a 2-week period. Expression analysis of 25 HSA21 genes and seven normalization genes in these cell lines showed high-expression correlations for all genes, and inter- and intra-cell line variation were not significantly different (R. Lyle and A. Reymond, unpublished data).
To determine which genes were differentially expressed between individuals, we used the variance ratio (15), which gives the ratio of the inter-individual variance over the average intra-individual error. Twenty-five genes were selected for further study based on having a variance ratio >1 (Table 1).
The median fold change in expression (ratio of individual with highest expression over individual with lowest expression) for the set of differentially expressed genes was 3.2, but values ranged from 2 to 47.43 (Table 1). However, for some genes (e.g. APP and CBS) in which a number of individuals display very low expression levels, the fold change might be overestimated due to limitations in the sensitivity of detection.
To determine what fraction of the variance of the differentially expressed genes could be attributed to genetic components, we measured expression levels in 10 three-generation families from the CEPH collection (21). In total, we assayed cDNAs from 135 individuals, including the initial 40 unrelated individuals who are the grandparents of these pedigrees.
Seventeen out of the 25 differentially expressed genes had significant heritability values (P<0.05), with a maximum of 0.84 observed for SLC37A1 (P=8.07×10−11) and a median of 0.42 for all genes (Fig. 2).
We also calculated the significance of the heritability estimates for each trait by multiple permutation analysis. For this purpose, we permuted the phenotype (normalized relative expression level) for each gene 500 times and calculated the heritability using the same family structures. Results from this calculation showed that all 17 genes previously found to have significant heritabilities remained significant, and in addition, two of the genes that had been considered marginally not significant (INFAR2 and PRDM15 both at P=0.06) were significant at the 95% level by multiple permutation and were thus considered for quantitative linkage analysis.
To identify candidate loci involved in the transcriptional control of chromosome 21 genes, we performed genomewide quantitative linkage analysis on the 10 study families. We used SNP genotyping information available in public databases. All genotyping errors as assessed by Mendelian incompatibility patterns were removed (22).
For our pedigrees, we estimated 60% power to detect a suggestive (P<0.001) eQTL given the median heritability of 0.42. This power increases to >90% for heritability values over 0.6.
Multipoint linkage analysis resulted in the identification of at least one eQTL for four genes (first four genes) (Table 2) using a threshold of P<1.6×10−4 (which corresponds to the theoretical genomewide significance of 0.05) (23). The most significant result was obtained for TMEM1 with P=7.46×10−8 (LOD score of 6). In one case, the eQTL for a gene (CCT8) mapped to within 5 Mb of its physical location (cis-eQTL), whereas the other eQTLs were located elsewhere in the genome (trans). In addition, for TMEM1, multiple trans-eQTLs were identified (Fig. 3).
If the criteria are relaxed to a threshold of P<0.001 (suggestive linkage), eQTLs for eight additional genes are identified (last eight genes) (Table 2).
Variance component analysis is one of the most commonly used model-free methods to perform quantitative linkage analysis (24,25); however, it assumes that the traits analyzed are normally distributed. Violations of normality by the presence of outliers or excessive kurtosis can result in an increased level of type 1 errors. (24,26,27) Many quantitative traits are not normally distributed, for example, 84% of our expression traits significantly deviate from normality. Ignoring such deviations can lead to important misinterpretation of the results (24).
One way to deal with the problem of non-normally distributed traits is to determine the empirical significance of the linkage results (obtained through variance component or regression methods) by performing simulation studies. To this end, we computed 1000 simulations by randomizing the genotypes but keeping the expression phenotypes constant. Genotype randomization has the advantage of keeping heritability patterns and other aspects of the data structure, such as marker density, allele frequencies and missing data unchanged.
We analyzed the results of the simulations by extracting the highest linkage score per genome-scan per trait, in order to build a significance distribution (the simulated maxium LOD scores for each trait follow chi-square distributions). This analysis showed that in nine out of 12 cases, the genes with significant or suggestive eQTLs remained significant at the 95% level (genes in bold, Table 2). Hence, a high proportion of the eQTLs with suggestive P-values can be considered as significant on a genomewide basis according to the simulations.
For some traits, the rates of type 1 errors were considerably higher than expected (Fig. 3). For example, in the case of TMEM1 where several trans-eQTLs were initially identified, only one locus remained significant after adjustment.
As performing large-scale simulations is a computationally intensive process, we assessed an alternative method to address the problem of non-normality. We used Box–Cox and bivariate-normal copula transformations in order to approximate the traits distributions to normal. These transformations have a minimal effect on the internal correlations of the data. One thousand simulations were performed for each transformed trait to determine the effect this would have on the power and the rates of type 1 errors. Results of these simulations showed that while the power was not greatly affected by either type of data transformation (data not shown), the effects on the levels of type 1 error were variable and produced, at best, only small improvements (Supplementary Material, Fig. S1). These results suggest that data transformations are not efficient at solving the increased rates of type 1 errors observed for some traits, and thus, simulations are highly advisable to properly assess the significance of the results for each trait studied.
To further validate the cis-eQTL identified for the CCT8 gene, we performed an association study using 60 additional unrelated individuals from the CEPH collection for which high-density genotyping data are available as part of the international HapMap project (28).
Gene expression variation was measured as before, and genotypes for 41 SNPs from a 100 kb region surrounding the CCT8 gene were obtained. We performed association analysis for each SNP (predictor) to the normalized relative CCT8 expression values (response) using analysis of variance (ANOVA), and corrected for multiple tests using a step-down approach (29). To guard against the effect of non-normal trait distribution, we transformed the raw phenotypes to their logarithmic values to yield a normally distributed trait (Shapiro test for normality, P=0.56). Four SNPs were significantly associated with CCT8 expression levels (P=2.5×10−5) even after correction for multiple testing (P=9×10−4). Out of these SNPs, rs965951 is located in intron 14 of the gene, and the three other SNPs (rs2832159, rs8133819, rs2832160) are located 12 kb upstream. All four SNPs are in complete linkage disequilibrium (r2=1.0) (Fig. 4).
We set out to dissect the patterns of gene expression variation on a 25 Mb segment of Hsa21, which has traditionally been associated with many of the defining features of DS (mental retardation, characteristic facies, muscle hypotonia) (20,30), as well as with some of its variable phenotypes including congenital heart defects (31). In addition, this region of Hsa21 has also been linked to some common complex disorders such as bipolar affective disorder of unknown genetic etiology (32,33).
Our approach consisted in using real-time Taqman PCR and multiple replicates per sample, as this has been shown to allow the detection of small inter-individual differences in gene expression (10) that might nonetheless be physiologically important (34,35). Our data show that differences as small as 1.4-fold could be reliably measured, which compares favorably to microarray procedures in which few replicates per sample are performed, and for which typically only changes >2-fold can be confidently detected (36).
Our results revealed a substantial amount of inter-individual gene expression variation, as 25 out of the 41 genes analyzed (61%) showed differential expression (based on our conservative criteria). This finding supports the notion that gene expression variation is a highly important but poorly characterized source of molecular diversity that is likely to explain a considerable fraction of the phenotypic variance in the population (37). The median fold change in expression level between the lowest and highest values per gene among individuals was 3.2, with the largest inter-individual expression differences observed for APP (Fig. 1). This is interesting, as APP is crucially involved in the pathology of Alzheimer's disease, and different levels of APP could influence the pathogenesis of this disorder.
DS is a disease caused by alterations in gene dosage such that affected individuals are expected to have an average 1.5-fold overexpression of Hsa21 genes (20). Given the extensive variation in gene expression observed in our samples among normal individuals (at least 2-fold), we would predict that for many Hsa21 genes, there is a considerable overlap in total expression levels between normal and trisomy 21 individuals due to allelic variation. Thus, expression variation at the levels we report here may help to explain two related aspects of DS phenotypes: penetrance and variability. Overexpressed genes which have low levels of expression variation would be predicted to lead to the more penetrant phenotypes (e.g. CCT8 and U2AF1). In contrast, genes with high variation in expression (e.g. ITGB2 and CBS) would contribute to incompletely penetrant/variable DS-related phenotypes (model shown in Supplementary Material, Fig. S2).
However, because the gene expression variation data reported here are based on lymphoblastoid cell lines, confirmation of these results in additional cell types/tissues would be of great interest.
Analysis of the segregation patterns of these ‘expression traits’ in 10 CEPH pedigrees revealed that for 76% (19/25) of genes, a significant fraction of the variation is explained by genetic factors, with a median heritability of 0.42. Genomewide quantitative linkage analysis using previously generated genotyping data (38) led to the identification of significant eQTLs for nine out of the 19 genes, after correction for biases in the data structure and for multiple testing. The large majority of the eQTLs identified were in trans, suggesting that quantitative or qualitative differences in certain genes (transcription factors, proteins involved in signal transduction and others) (18) have an impact on larger transcriptional networks. However, given the results of other studies showing abundant allele-specific expression differences (39,40), we expect cis-regulatory variation to be common and most likely under-represented in our results.
For the CCT8 gene, which was the only one with a cis-eQTL, we performed an association study to validate the linkage findings. Results of the association analysis revealed that a group of four SNPs within or close to CCT8 were significantly associated with its expression levels, independently confirming the presence of cis-regulatory variation.
An interesting finding is that for some genes with high expression heritabilities (e.g. PWP2H and PFKL), no significant eQTL was identified even though power calculations on our sample size suggest ∼90% power to detect such loci. We thus hypothesize that in many cases, expression traits are regulated by multiple loci, each of which contributes only modestly to the trait. This higher complexity in the genetic architecture (41,42) underlying expression variation indicates that only in cases in which few loci account for a large proportion of the expression variability, eQTLs will be identified effectively. This may explain the under-representation of cis-eQTLs, and suggests that larger sample sizes are required to dissect more complex genetic regulation.
An important observation from our study is the increased type 1 error for many of the expression traits when performing variance components calculations for eQTL mapping. This highlights the need to take appropriate measures, such as simulations, to enable correct interpretation of the significance of the results.
In two previous studies, QTL mapping of gene expression traits was performed in humans using microarrays (17,43). These studies have a higher throughput, with thousands of genes being analyzed in parallel, which gives a more global view of the genetic control of gene expression. However, for specific regions involved in the etiology of human disorders, such as the one presented here, a more focused study design with higher number of replicates is possible, leading to a more accurate estimation of individual expression levels.
In this study, we provide a detailed view of gene expression variation of Hsa21 genes and a screen for eQTLs involved in their regulation. The extensive expression variation observed has important implications concerning the molecular pathogenesis and phenotypic variability in DS patients. Large-scale association studies with appropriate samples sizes will constitute the next step for the identification of regulatory variation.
MATERIALS AND METHODS
Families and lymphoblastoid cell culture
We obtained EBV-transformed lymphoblastoid cell lines of 135 individuals belonging to 10 CEPH (21) families (1333, 1334, 1340, 1341, 1345, 1346, 1347, 1362, 1408, 13292) from Coriell cell repositories. All cell lines were grown in RPMI 1640 with Glutamax I medium (Invitrogen Corporation) supplemented with 10% fetal calf serum and 1% penicillin and streptomycin mix (Invitrogen Corporation). Cells lines were harvested at a density of 0.6–1×106 cells/ml and at least 80% viability. Cultures were spun for 5 min at 1000 g, and the resulting pellets were washed once in PBS and lysed by adding 2 ml of micro-glass beads (Sigma) and vortexing in 1 ml lysis solution containing β-mercaptoethanol (Qiagen, RNeasy kit). Cell lysates were stored at −80°C.
RNAs were extracted using RNeasy mini kits with on-column DNAse I digestion (Qiagen). RNA samples were quantified by spectrophotometry, and the quality was assessed using an Agilent 2100 BioAnalyzer with the RNA 6000 Nano LabChip. All RNAs had a 260/280 nm ratio between 1.8–2, and a 28s/18s rRNA ratio above 2.
cDNAs were synthesized from total RNA using SuperScriptII reverse transcriptase (Invitrogen Corporation) and a poly d(T) primer. For each cell line, 5 µg of total RNA in a total volume of 20 µl were used, and the resulting cDNA was diluted 1:14 prior to PCR.
Gene selection and assay design
We chose to study a 25 Mb region of Hsa21 encompassing both the so-called DS critical (30) region and the minimal region thought to be involved in the development of the heart defect phenotypes (atrial or ventricular septal defects or complete atrioventricular canal) present in ∼40% of DS patients (31). This region contains around 109 well-characterized genes.
On the basis of experimental data indicating expression in lymphoblastoid cell lines (www.ensembl.org/Multi/martview?species=Homo_sapiens), we selected 71 genes for which we designed Taqman assays. Assay designs are listed in Supplementary Material, Table S1.
Real-time quantitative PCR
Real-time quantitative PCR was carried out essentially as described in Lyle et.al (10). Briefly, intron-spanning Taqman assays were designed using PrimerExpress (Applied Biosystems) with default parameters. Assay efficiencies were calculated using a cDNA dilution series (44). All PCRs were performed using a qPCR mastermix (RT-QP2X-03, Eurogentec).
Reactions were set up using a Biomek 2000 robot (Beckman), in a 10 µl volume in 384-well plates. Six replicates per gene per sample were performed. PCRs were run in an ABI 7900 Sequence Detection System (Applied Biosystems) with the following conditions: 50°C for 2 min, 95°C for 10 min and 50 cycles of 95°C 15 s/60°C for 1 min.
To maximize the reproducibility of our results, we premixed the required amount of cDNA, with a large volume of PCR mastermix, to ensure uniformity in the starting concentration of cDNA in all assays and replicates. In addition, the experiment was designed so that all assays for a particular individual were run at the same time on a single plate, as this renders the normalizations more accurate.
We checked that all CT-values were within the ranges tested in assay efficiency tests. In addition, no systematic biases were detected in the results as determined by assessing correlation between the following variables: PCR efficiency (E), threshold cycle (CT), relative expression and error (10).
qPCR data analysis
Raw cycle threshold (CT) values were obtained using SDS 2.0 software (Applied Biosystems). A threshold value of 0.2 was used for all genes, and background was changed manually for individual genes as recommended by Applied Biosystems. From the six replicates per gene, outliers were detected using Grubb's test at the 95% significance level. For all calculations, CT-values were converted to quantity (q) with the formula q=2−CT. Analyses were carried out in Excel, SPSS and Minitab (MINITAB Inc.).
We normalized expression values by performing a median normalization across all genes, excluding the 75th percentile to minimize the effect of outliers. Each gene was then median normalized across individuals. Normalized relative expression values thus have a median of 1.
To determine which genes show significant levels of expression variation in our cohort, we measured expression levels in a group of 40 unrelated individuals. We evaluated the levels of gene expression variation by means of the variance ratio obtained by dividing the inter-individual variance of the means of each gene by the mean intra-individual error. We selected 25 genes which had a variance ratio above 1.3.
Heritability, linkage, simulations and association
Heritability calculations were performed using the ‘polygenic-screen’ command from the SOLAR software (45). To calculate the empirical significance of the heritability values, we performed a multiple permutation test in which the phenotypic values for all traits were randomly permuted 500 times and heritabilities calculated as before. These results were used to determine the 95% significance thresholds.
SNP genotyping data, consisting of 2688 autosomal SNPs with an effective resolution of 3.9 cM, were downloaded from the SNP Consortium database (http://snp.cshl.org/linkage_maps/) (38). Multipoint linkage with the SNP map was performed using Merlin (46) with the –VC option, after Mendelian inconsistencies (PEDCHECK) (22) and unlikely genotypes (PEDWIPE) (46) were removed.
To calculate the empirical significance of the linkage results, we performed 1000 simulations for each quantitative trait using the –simulate command from Merlin with different seed numbers. We extracted the highest result from each simulation to build significance distributions. The distribution of the maximum scores of permutations followed a chi-squared distribution.
To evaluate the effect of non-normality on the power and levels of type 1 errors in our data, we transformed all traits by: (i) Box–Cox transformation using Minitab, (ii) the bivariate normal copula as described by Basrak et al. (27). We performed 1000 simulations for each transformed trait and built significance distributions as described above. All simulations were performed using a cluster of 32 HP/Intel Itanium 2 based servers at the Vital-IT Center.
Association of CCT8 expression and cis-nucleotide variation was done using anovasnp (version 0.7) (D. Posada, manuscript in preparation) with 100 000 permutations to correct for multiple hypotheses using a step-down procedure (29). Genotypes were downloaded from the HapMap project URL (http://www.hapmap.org/cgi-perl/gbrowse/gbrowse/hapmap), HapMap public release no. 16c.1.
Supplementary Material is available at HMG Online.
We thank Alexandre Reymond, Bernard Conrad and Jacques Beckmann for useful discussions and critical reading of the manuscript. Denis Cohen for bioinformatics help, and Dorret Boomsma and Bojan Basrak for providing the scripts for copula transformation. This work was funded by the Foundation Jérôme Lejeune, the ChildCare Foundation, The Swiss National Science Foundation (3100-057149.99/1), the EU/OFES (QLG1-CT-2002-00816) and the NCCR Frontiers in Genetics.
Conflict of Interest statement. Authors have not declared any conflict of interest.
These authors equally contributed to this study.
Present address: EPAM, Norwegian Institute of Public Health, Oslo, Norway.
|Gene||Variance ratio||Fold expression difference|
|Gene||Variance ratio||Fold expression difference|
|Gene||eQTL(s) position||SNP multipoint P-values (Merlin)|
|TMEM1||12q24.33, 11q24.2, 4q26.2||7.46×10−8, 4.24×10−6, 2.47×10−5|
|Gene||eQTL(s) position||SNP multipoint P-values (Merlin)|
|TMEM1||12q24.33, 11q24.2, 4q26.2||7.46×10−8, 4.24×10−6, 2.47×10−5|
Values in bold are significant on a genomewide basis as determined by simulation studies.