In the last decade, the search for the genetic origins of phenotypic variation has expanded beyond the non-synonymous variants which alter the amino acid sequence of the encoded protein, and many examples of sequence variants which alter gene expression have been found. Recently, using both traditional and novel technologies, a number of surveys have been carried out to examine the frequency with which cis-acting sequence variants or other cis-acting effects, alter gene expression either in vitro or in vivo. Microarray data have shown that the expression of many genes varies markedly between individuals and allele-specific expression studies have shown that the source of much of this variation appears to be cis-acting effects. A significant proportion of the variation may originate in gene promoter regions and a large number of sequence variants which have functional effect in vitro have been found. The evidence suggests that given a large enough population, most, if not all genes may have allele-specific expression differences in at least some individuals and finding the genetic origins of each of these and linking the former to a possible phenotype must be a major long term goal of the biomedical community.
Sequence variations in protein encoding genes may influence phenotype and at a conceptual level this occurs in two ways: changes to the quality or changes to the quantity of the encoded protein. Those changes are transmitted from the gene to the protein via mRNA, represented by changes to the sequence or the abundance of the encoding mRNA. The unexpectedly small number of protein encoding genes in the human genome has led to the conclusion that a large proportion of inherited human phenotypic variation, including variation in response to the environment, is influenced more by the quantity than quality. However, the regulatory regions of genes are largely uncharacterized both functionally and structurally.
cis-ACTING REGULATORY SEQUENCE VARIANTS CAN CAUSE HUMAN DISEASE
It has long been known that mutations in non-coding regions which affect gene expression can cause human genetic disease. Classic examples include variants of the thalassaemias (1), hypercholesterolemia (2) and, at a slightly more complex level, Fragile X syndrome. However, at present, the majority of known functional DNA variants that have been related to disease, affect protein structure (3). The Cardiff Human Genome Mutation Database (http://archive.uwcm.ac.uk/uwcm/mg/docs/hohoho.html) currently lists 26 535 mis-sense/nonsense mutations compared to 545 regulatory mutations (May 2004). However, the dominance of the former group in disease causation is probably at least partly the result of bias, as such variants are simpler to identify than those that affect expression. This is because their location is generally predictable (within or around coding sequence) and they are also generally easier to recognize as variants that alter mRNA coding. Alternatively, the discrepancy may be due to regulatory polymorphisms having smaller phenotypic effects such that, in isolation, they are not sufficient to cause disease.
Candidate gene analyses of complex phenotypes have also tended to be geared towards the analysis of coding variants. However, this may be the wrong approach in complex phenotypes for which more subtle changes have been predicted to dominate (4,5). The potential importance of regulatory variants for complex diseases is underscored by reports that have already implicated such polymorphisms in susceptibility to complex diseases such as autoimmune disease (6), rheumatoid arthritis (7), myocardial infarction and stroke (8), diabetes (9), inflammatory bowel disease (10) and schizophrenia (11,12) and in other diseases such as asthma in which the pathogenic spectrum includes, but is not dominated by, such changes (13–15).
The very nature of complex phenotypes suggests that the effect of any one gene variant in isolation is not great enough to generate a phenotype. Therefore, associating sequence variation with complex phenotypes is often difficult and the field is littered with unreplicated findings. The detection and recognition of variants that affect expression poses formidable problems because the genomic organization of the regulatory sequences for any given gene is generally unknown. Thus, they may be located within the transcript itself, introns, the immediate 5′ flanking region (promoter and adjacent sequences) or within enhancer and silencer elements that can be several tens or hundreds of kilobasepair up or downstream from the transcribed sequence.
DIFFERENCES IN GENE EXPRESSION LEVELS IS COMMON
If altered gene expression due to sequence variation in regulatory regions is a common primary mechanism for phenotypic variation and disease, it implies that: A number of studies have examined the differences between tissues subject to various stimuli and diseases or other phenotypes (16). However, only recently has variation of gene expression between individuals without a disease state been examined. Several studies have been reported, each of which used microarrays.
Variation in gene expression occurs and is common and, regardless of the mechanisms underlying the variation, this should be detectable as differences in expression levels between individuals.
The expression of a substantial proportion of genes is influenced by polymorphism in regulatory elements.
A study of brain and liver mRNA from three individuals showed that global differences between them were as great as differences between humans and chimpanzees (17). A larger study surveyed variation in gene expression patterns in peripheral blood lymphocytes from 75 healthy individuals (18). Approximately 18 000 genes were studied and 370 were found to have a 2-fold or greater variation from the mean in at least five individuals. Although the requirement for five individuals showing differences in expression is conservative, the number of variably expressed genes (2%) is low.
In comparison, two studies analyzed lymphoblast RNA derived from CEPH families. The first of these found a wide variation in expression levels of many genes between individuals (19). The variance ratio for 813 genes varied between 0.4 and 64 with a median value of 2.5. The 40 genes with highest variability appeared to be randomly spread across the genome. The heritability of the variability of expression of five genes was investigated by comparing expression levels between monozygotic twins, siblings and unrelated individuals; expression levels of monozygotic twins was similar, whereas in the latter two comparisons the expression levels were 2–5 and 3–11 times more variable, suggesting a genetic influence.
The second study analyzed lymphoblast RNA derived from 56 individuals from four CEPH families and identified 2726 (11%) out of approximately 25 000 genes which were significantly differentially regulated (type I error=0.05) within eight or more of the 16-pedigree founders (20). Of those, 29% had a detectable genetic component. Such a high percentage in such a small number of individuals indicates that variability of gene expression levels has a high heritability.
The complexity of analysis of microarray data, the differing presentation of results and the different thresholds and definitions of differential regulation, make a direct comparison of the earlier mentioned studies difficult. However, the studies of CEPH lymphoblasts (19,20) both suggest considerable variation between individuals and also that much of the source of this variation is genetic. In comparison, the study of blood lymphocytes (18) showed relatively few differences between individuals, and in addition, much of the variation arose from age, gender, time of day of sample collection and health status of the individuals. The origins of this discrepancy is not clear; mRNA levels in primary lymphocytes might be expected to be more variable than that of cultured lymphoblasts, owing to state and environmental differences between the donors, whereas the opposite is seen.
ORIGINS OF THE VARIATION
Although changes to the gene sequence which lead to changes to the amino-acid sequence of the protein are by definition, heritable and ‘genetic’ changes in the mRNA abundance in a specific tissue can be induced by several mechanisms. These include: In addition, inter-individual comparisons are subject to a number of potential errors including: Both the cis–trans-distinction and the intrinsic errors in inter-individual comparisons earlier mentioned can be controlled by the use of assays which use only a single individual.
Allele-specific variation (cis effects).
Inter-individual variation as a result of polymorphism in other genes that regulate the expression of the target genes (trans effects).
Inter-individual variation in environmental factors that regulate gene expression.
Variation in gene expression as a consequence of phenotypic state (e.g. self-neglect, drugs and nutrition).
Variation in mRNA quality between subjects as a result for example of post mortem delay or ante mortem agonal state.
Variation as a result of minor differences in the proportions of different cell types in a block of tissue.
Artefacts resulting from the imortalization of lymphoid cells (when using lymphoblasts).
ALLELE-SPECIFIC GENE EXPRESSION DIFFERENCES IN HUMANS
The detection of allele-specific gene expression in a single individual relies on the ability to distinguish the gene product of one parental chromosome from that of the other, and then to quantitate the relative amounts of each gene product that is produced. Several different methods have been described and two of those methods have been used to survey a number of genes.
Three studies have been carried out using the method of single nucleotide primer extension (21). The principle of this procedure is shown in Figure 1. A transcribed polymorphism is used as a marker to distinguish between the mRNA products of the parental chromosomes. The relative abundance of each allele from a heterozygous individual is then quantitated using RT–PCR and primer extension with radiolabelled (21) or fluorescent nucleotides (22–24). Both gene copies come from the same tissue samples and have been subject to the same environmental influences including genetic trans-acting factors and experimental insults including mRNA degradation. In the absence of either cis-acting sequence variation or epigenetic effects affecting expression of the target mRNA, each chromosome should be equally expressed regardless of the absolute level of gene expression. The ratio of the abundance of each allele is therefore expected to be ∼1. In practice, this ratio varies for experimental reasons but can be controlled for using the source genomic DNA as a control. In samples that are heterozygous for a cis-acting regulatory variant or epigenetic modification, mRNA originating from one chromosome will be expressed at a higher level than that from its sister chromosome and this is detected by changes in the ratio of abundance of each mRNA allele.
In the first reported survey of allelic discrimination, this method (21) was used to study 13 genes in 96 CEPH lymphoblast samples and variable allelic ratios were found in six genes (22). The variations were found in 18–39 % of individuals and the magnitude of variation ranged from 1.3- to 4.3-fold, although only one gene gave variations above a factor of 1.9. Of importance, the results from three families were consistent with inheritance of the variations.
A second survey used a similar method to that mentioned earlier to study an additional 15 genes (23). Of importance, the mRNA was extracted from 50 post mortem human brain cortex samples, eliminating the possibility of artefacts caused by the immortalization of lymphocytes. Seven genes displayed allelic expression differences of over 1.2-fold, the largest being 1.7. For the variably expressed genes, between 5 and 66% of individuals showed differences.
A third survey again used a similar method to that mentioned earlier to study 129 genes in lymphoblast mRNA (24). Of those, 23 genes showed relative allelic expression of 1.7-fold or greater (the chosen cut-off point) the largest difference, not including imprinted genes, being 9-fold. For the variably expressed genes, between 11 and 100% of individuals showed differences. At least one of these genes was also shown to have allele-specific expression in adipose tissue.
The similarity in these three studies of the magnitude of changes and the proportion of both the genes and individuals showing variation suggests that the source of the tissue is not critical, although clearly any state effects related to illness and drug treatment could only be detected in primary tissue samples.
Two of the earlier mentioned studies (22,23) quantitated the relative abundance of the extension products using ABI Prism SNaPshot Multiplex Kit. This method of quantitating the relative ratio of two alleles is based on technology developed for rapid genotyping of pooled samples (25) and is accurate to 3%. This extraordinary accuracy allows small differences in ratio to be reliably detected. However, the primer extension method is relatively slow. A higher throughput method, allele-specific microarrays has been used to measure the levels of mRNAs (26). In total, 602 genes were studied and 326 (54%) showed a greater than 2-fold difference, whereas 170 (28%) showed a >4-fold difference. However, this method only allowed the identification of 2-fold or greater differences in allelic expression. The work was carried out using liver and kidney from a relatively small number of human fetuses. For each gene not more than five heterozygotes were studied, and for over 220 (68%) of the genes only one or two heterozygotes were available. Thus, they had very low power to detect allelic differences and the rate of differences found is therefore far higher than that found by others (22–24), and it is not clear whether the higher frequency and magnitude of variation stems from the tissue differences or experimental limitations.
The methods described earlier are either highly accurate but relatively time consuming and expensive (22,23) or have high throughput but with unproven accuracy (26). A number of other methods have been published which may allow higher throughput along with accuracy, although at the time of writing only a few genes have been analyzed.
One limitation of both the methods mentioned earlier is the requirement for a transcribed sequence variant in the individuals to be studied. An alternative method overcomes this limitation in that although a marker SNP is still required to differentiate between the two transcripts, this marker may, in principal, be anywhere in the gene (although SNPs closer to the start of transcription will give more accurate results) (27). The method discriminates between the two alleles on the basis of differential initiation of transcription from genomic DNA. DNA polymerase II binds to the promoter of a gene and initiates transcription. A requirement for this process is the phosphorylation of the DNA pol II at two serine residues including Ser5. For allelic discrimination experiments, an antibody specific to Ser5 phosphorylated DNA pol II is added to the samples, protein–DNA crosslinking occurs following formaldehyde treatment. The chromatin is fragmented by sonication and the resulting product, bound chromatin, is immunoprecipitated (ChIP). The relative amount of each allele of chromatin which has been immunoprecipitated was measured using PCR and primer extension, the latter being quantified using mass spectroscopy. One limitation of this technique is that it lacks the internal control for relative amplification and detection of the different alleles; the methods mentioned earlier (23,24) use genomic DNA as a control for the relative peak height obtained from the cDNA. At the time of writing, only a single gene has been analyzed using this method (27).
A modification of the genotyping technique described as ‘digital’ (28,29) has also been shown to have utility at allelic discrimination (30). cDNA is diluted down to the order of one copy per microlitre in a solution which contains polyacrylamide and PCR reagents. A thin layer of material is placed on a microscope slide and polymerized. Following thermal cycling the PCR products are immobilized and can be visualized as spheres within the gel, each sphere representing the product of a single gene copy. In situ primer extension with fluorescently labelled nucleotides was used to discriminate between alleles. This method was used to replicate a previously published analysis of the relative expression of alleles of the gene PKD2 (22) using the same CEPH samples (30).
A sophisticated method of measuring both absolute and relative mRNA levels of specific alleles has been developed (31). Reverse transcribed cDNAs were spiked with an artificial DNA standard, and following competitive PCR the products were measured using mass spectroscopy. The authors claim an accuracy similar to that obtained using other methods (22), and high throughput is clearly possible. However, they have so far reported no data on allelic discrimination.
A purely bioinformatic approach has also been taken (32). The frequency of occurrence of different alleles of 19 312 SNPs in ESTs in public data bases obtained from individuals deduced to be heterozygous for the SNP was examined. About One hundred and ninety four SNPs were identified as being differentially expressed. The utility of this method appears to be limited as although two genes known to be imprinted were found in the top 1% of SNPs ordered according to increasing P-value, a further 48 imprinted genes were not identified, suggesting that only a proportion of genes in the order of 4% were examined.
One limitation that each of the methods mentioned have in common is that the source of the difference in allelic expression cannot be directly determined. Although this must in theory be cis-acting it may be due to a sequence variant anywhere in the region of the gene, or it may be due to epigenetic effects (the transmission of information from a cell or multicellular organism to its descendants without that information being encoded in the nucleotide sequence of the genes). Although the latter are usually considered to be mainly related to X-chromosome inactivation and silencing of a limited number of other genes, a far greater role has been proposed (33), and some findings suggest that altered gene regulation as a result of epigenetic modification of sequence variation might be a common pathogenic mechanism in mammals (34). One possible role for epigenetics is polymorphic imprinting leading to mono-allelic expression and evidence for this mechanism in the control of the expression of the 5HT2A gene has been presented (35,36). However, using the allelic discrimination techniques mentioned earlier, evidence for this effect could not be found in the brain tissue (37), although it was found in lymphoblasts (24). Two of the earlier mentioned surveys (22,23) found no genes which were expressed from only one chromosome, but the third (24) found three such genes including HTR2C. However, all three genes showed non-Mendelian inheritance and as the lymphoblast cell lines were shown to be either mono or oligo-clonal, suggesting random mono-allelic expression (24). Overall, the earlier mentioned results are consistent with low numbers of genes being imprinted; however, such an effect may be tissue specific, and partial imprinting may occur.
THE IN VITRO APPROACH
Assuming that the source of variable allelic expression is sequence variants, it is still not possible to identify the causative variant using allelic discrimination. For example, a specific sequence variant in the 3′-UTR of the COMT gene was demonstrated to be associated with variable allelic expression in all heterozygotes (38). However, this does not indicate the variant to be causative, as it may be in linkage disequilibrium with another, possibly unknown sequence variant.
In order to determine the functional effect of any sequence variant on gene expression, it is necessary to control for the effects of any other variant in the genome. Although this may in theory be carried out by studying a large number of individual DNA samples in vivo all of which have been sequenced to determine the genotype at all polymorphic sites, in practice the method of choice is the in vitro reporter gene assay, not the least because it has been widely used and verified.
A large number of sequence variants in the promoter regions of candidate genes have been analyzed using reporter gene assays. A review of such experiments found 107 genes with functional polymorphisms in the 5′ flanking region (39). Of those 63% have allelic differences of 2-fold or greater in their rates of transcription, and 10% had 10-fold or more, differences. It is not clear how many negative assays have been carried out, as publication bias has almost certainly come into force. However, an extrapolation of the findings suggests that 30% of genes have common functional promoter SNPs (39).
The experiments reviewed earlier (39) were with few exceptions carried out on individual genes by research groups studying a limited number of genes. A diverse range of reporter gene systems was used. In comparison, in a recently completed study, the author and colleagues have studied the effects on transcription of sequence variants in the promoter regions of 249 genes (40–46; and unpublished data). From a screening set of 16 individuals, 55 genes (11%) had promoter haplotypes which gave transcriptional activity of 1.5-fold or greater than other haplotypes. More than half of the effects could be ascribed to less common (<5% minor allele frequency) variants and this suggests that if a larger screening set had been used a greater number of functional variants would have been found. In addition as with most experiments of this kind, the assays were carried out under basal conditions with no specific stimuli applied. This coupled with the fact that only two cell lines were used for the majority of the assays suggests that many tissue and state specific effects were not seen. The figure of 11% is therefore conservative.
Both in vivo and in vitro experiments suggest that allele-specific differences in the rate of transcription are common and that most, if not all genes are likely to show differential allelic expression in some individuals. Sequence variants rather than epigenetic effects probably underlie most cis-acting effects, although, in only a few genes have the effects been shown to be due to a specific sequence variant. In vitro experiments show that sequence variants in gene promoter regions frequently alter rates of transcription and these promoter variants may account for a significant proportion of differential allelic expression.
The role of allele-specific gene expression in complex phenotypes, however, is still not clear as several questions have not been answered for the great majority of the gene expression changes described earlier. Answering the above questions and plotting the biological route from sequence variant to phenotypic change for each variant and each phenotype is a major task which will take either some considerable time or new technologies to complete.
Do the changes in relative allelic rates of transcription (cis effects) lead to overall changes in mRNA levels, or do homeostatic compensatory mechanisms (trans effects) lead to a limited effect?
How do the ‘functional’ SNPs affect rates of transcription?
What percentage change in transcription is required for a phenotypic effect?