Even though autoimmune diseases are heterogeneous, believed to result from the interaction between genetic and environmental components, patients with these disorders exhibit reproducible patterns of gene expression in their peripheral blood mononuclear cells. A portion of this gene expression profile is a property of familial resemblance rather than autoimmune disease. Here, we wanted to identify the portion of this gene expression profile that is independent of familial resemblance and determine whether it is a product of disease duration, disease onset or other factors. By employing supervised clustering algorithms, we identified 100 genes whose expression profiles are shared in individuals with various autoimmune diseases but are not shared by unaffected family members of individuals with autoimmune disease or by controls. Individuals with early disease (1 year after onset) and established disease (10 years after onset) exhibit a near-identical expression pattern, suggesting that this unique profile is a product of disease onset rather than disease duration.
Autoimmune diseases are heterogeneous diseases believed to arise from immune-mediated attack against self-antigens. Both genetic and environmental factors play important roles in their onset and pathogenesis (1–3). Epidemiologic data along with genetic linkage studies clearly support the presence of a genetic contribution to susceptibility to autoimmune disease (4–12). Linkage studies have demonstrated the presence of susceptibility loci that are shared among multiple autoimmune diseases (4–6) and those that are unique for a given autoimmune disease (8–10).
Specific gene expression profiles have also been found in the peripheral blood mononuclear cells (PBMC) of individuals with autoimmune diseases (13–23). Some of these, for example, an ‘IFN-signature’, have been found in systemic lupus erythematosus (SLE), which is a function of disease severity (18), and an early disease signature has been found in rheumatoid arthritis (RA) (23), seem to be unique to a given type of autoimmune disease. In addition, we have described a gene expression signature that is shared among several autoimmune diseases, including RA, SLE, type 1 diabetes (IDDM) and multiple sclerosis (MS) (13). Thus, gene expression signatures that are both common to several autoimmune diseases and unique to a given property of an individual autoimmune disease exist.
Studies on the genetics of gene expression have been performed in model systems. The studies demonstrate that a portion of the gene expression profile or transcript level in a given tissue is an inherited trait (24–26). Thus, transcript levels are heritable phenotypes and this has been demonstrated under a variety of conditions across a range of species (27). A power of this approach is that linkage analysis can identify loci that control variation in transcript levels. Knowledge of the genes that encode these transcripts will undoubtedly stimulate identification of ‘candidate genes’ that reside in loci identified by linkage analysis. Along these lines, we have asked whether unaffected first-degree relatives of individuals with autoimmune disease share gene expression profiles observed in individuals with autoimmune disease. We find that these two groups share an overlapping gene expression profile (28). The genes that make up this overlapping profile are among the same genes that are differentially expressed in a number of different autoimmune diseases. This argues that a portion of the gene expression profile observed in autoimmune disease is a family trait that may arise through genetic mechanisms.
Therefore, we wanted to determine whether we could identify exclusive gene expression signatures shared by individuals with autoimmune disease but not by unaffected family members (refer to unaffected family members of individuals with autoimmune diseases, hereafter) or by control individuals. We also wanted to determine whether this gene expression signature was a function of onset of autoimmune disease or duration of autoimmune disease.
Supervised gene expression profiling permits separation of unaffected family members from individuals with autoimmune diseases
Previously, we reported that individuals with different autoimmune diseases share a common gene expression profile and that unaffected first-degree relatives of individuals with autoimmune disease share a portion of this profile (13,28). Here, we wanted to determine whether we could identify a second set of genes whose expression profiles permit separation of individuals with autoimmune disease from unaffected family members. To do so, we increased the number of autoimmune samples from eight to 54 and used number of controls (n=8) and unaffected family members (n=8), similar to our previous analysis. Of the 4133 genes for which we had expression data, 752 genes passed filtering condition at a standard deviation of 1, using the Cluster software. After filtering, expression data for the 752 genes were normalized and loaded into the TIGR MEV software. We performed a significant analysis of microarray (SAM) study using these 752 genes. For the first step, we set initial classifications: control individuals group, unaffected family members group and autoimmune individuals group. As the false discovery rate (FDR, the proportion of genes likely to be identified by chance) was set to 0.00000 (FDR<10−5), statistically, the number of genes identified by chance is <1, SAM-identified 100 genes whose expression levels separated the samples into different groups according to their initial classification. We applied hierarchical clustering (HCL) algorithm to the gene expression profile of these 100 genes (Fig. 1). The unaffected family members were grouped into a separate branch from the individuals with autoimmune disease, but within the same branch as the controls. The only exception was that control 21 clustered with two autoimmune patients (RA08 and SLE05), and this small cluster was grouped together with the autoimmune disease group. Two major patterns of gene expression were seen. One pattern contained genes that were highly expressed in controls and unaffected family members and weakly expressed in individuals with autoimmune disease. The second pattern contained genes that exhibited the opposite expression pattern, weak expression in controls and unaffected family members and high expression in individuals with autoimmune disease.
Support tree analysis of the reliability of HCL
By increasing the sample pool size and with the assistance of supervised SAM analysis, we successfully separated the unaffected family members from individuals with autoimmune disease. In order to test the reliability of the HCL results we obtained after SAM analysis, we performed Jackknife re-sampling 1000 times. For each re-sampling process, HCL was performed and the result was compared with the original clustering result. The percentage of the original clustering results that occurred during the 1000 re-samplings indicates the level of reliability or support of the clustering result (Fig. 1). The color of the nodes denotes the support level of the clustering, i.e. the frequency this clustering node appeared among the 1000 re-sampling processes. By performing support tree (ST) analysis, we found that the branch containing the unaffected family members and control individuals was separated from the branch of individuals with autoimmune disease and achieved >90% support by the Jackknife re-sampling test. Therefore, we concluded that the separation between unaffected family members, control individuals and individuals with autoimmune diseases was very reliable.
Principal components analysis
From the above analysis, the 100 genes identified by SAM analysis permitted reliable separation among control individuals, unaffected family members and individuals with autoimmune diseases. In order to gain an intuitive overview of how these three groups were distributed in three-dimensional space according to expression profiles of these 100 genes, we performed principal components analysis (PCA). PCA projected these three groups of individual samples (individuals with autoimmune diseases, healthy controls and unaffected family members) into a three-dimensional space and their positions in this space were determined by their gene expression profile. PCA clearly separated the samples into three distinct groups in space: unaffected family members (red spheres), healthy controls (yellow spheres) and individuals with autoimmune disease (other color spheres) (Fig. 2A, three-dimensional space; B, two-dimensional space). This analysis demonstrated two points. First, as individual samples, the control samples were more like other control samples than the non-control samples, the unaffected family member samples were more like other unaffected family samples than the other samples and the autoimmune samples were more like other autoimmune samples than the non-autoimmune samples. Secondly, as a group, the unaffected family member samples were more like the control samples than they were like the autoimmune samples. This result essentially confirmed our HCL results and demonstrated that the separation pattern may be determined by the similarity of the different groups.
Gene distance matrix
In order to further examine the similarity of these three groups, controls, unaffected family members and individuals with autoimmune disease, we examined the similarity among the three groups by calculating the distance metric (Euclidean distance, see Statistical Analysis) between the samples from each of two different groups (unaffected family members versus controls, unaffected family members versus individuals with autoimmune diseases and controls versus individuals with autoimmune diseases). The result is shown in the gene distance matrix (GDM) (Fig. 3). Each square element within the matrix is rendered a color that represents the distance or similarity between the two samples associated with the element. Color similarity indicates that two samples have a high degree of actual similarity. Thus, samples with dark blue colors indicate that the corresponding two samples have a higher degree of similarity than the samples denoted by pale blue colors. These data indicate that the expression profiles of the 100 genes identified by SAM analysis have a higher overall similarity between the unaffected family members and the controls than that between the unaffected family members and the individuals with autoimmune diseases or the controls and the individuals with autoimmune diseases. This result is consistent with the PCA analysis (Fig. 2) and can be considered as the underlying rationale for the distribution pattern of these samples in the PCA space.
Composition of SAM-identified 100 genes
In our previous study, we identified a shared gene expression signature in individuals with different autoimmune diseases (13). We examined the 100 genes identified by SAM. We found that 64% (64/100) of these genes represent previously identified autoimmune signature genes (13). We also examined the distribution of the autoimmune signature genes in the total 752 genes data set (110/752) identified by the initial filtering. The χ2 test revealed that the autoimmune signature genes were significantly over-represented in the set of genes identified by SAM (χ2=132.39, P<0.001).
In order to gain an overview of the potential biological relevance of alterations in the expression of the 100 genes identified by SAM, we performed EASE analysis to categorize the 100 genes identified by SAM into their biological ontology (see Materials and Methods). We considered categories as significant if they contained at least two genes with P<0.05 (Fisher's exact test with Bonferroni correction). The over-represented categories of the 100 genes identified by SAM included ‘RNA splicing’, ‘DNA damage response genes’, ‘amino acid catabolism’, ‘regulation of CDK activity’ and ‘response to toxin’ (Table 1). These results are consistent with the idea that lymphocytes from individuals with autoimmune disease exhibit defects in responses to DNA damage and cell cycle control, as we have shown previously, as well as defects in other processes (13,29)
Quantitative differences in gene transcript levels among the three groups
From the above analysis we found that the 100 genes identified by SAM were over-represented by autoimmune signature genes and that ‘DNA damage response genes’ and ‘regulation of CDK activity’ were over-represented biological categories. Next, we wanted to examine the quantitative variation in expression levels of the 100 genes identified by SAM in the different groups. We divided the 100 genes into two groups: ‘autoimmune signature’ genes and non-‘autoimmune signature’ genes according to our previous studies (13). Comparison of expression levels of representative autoimmune signature genes demonstrated that these genes were uniformly expressed at similar levels in controls and unaffected family members and were expressed at considerably lower levels in autoimmune individuals (Fig. 4A). The non-autoimmune signature genes displayed a more heterogeneous expression pattern (Fig. 4B). However, in most cases, expression levels of individual genes were higher in unaffected family members than in the controls or the autoimmune samples.
Expression levels of autoimmune signature genes in individuals with early RA
We have previously analyzed two populations of patients with RA, one population with an average disease duration of 10.5±2.6 years and the other with an average disease duration of 1.1±0.3 years (23). Therefore, we compared the gene expression profiles between the early RA (ERA) and the established RA to determine whether each group exhibited the same expression patterns of the 100 genes identified by SAM. We found that these genes were highly expressed in controls and unaffected family members but weakly expressed in both ERA and RA samples with similar expression patterns seen in the two RA groups (Fig. 5). We conclude from this comparison that expression levels of these genes change early after the onset of RA and remain relatively constant thereafter. We presume that gene expression patterns in the other autoimmune diseases follow a similar pattern.
In order to validate our results, we arbitrarily divided our samples into two sets, a training set and a test set; each set contained four controls, four unaffected family members and 27 individuals with autoimmune disease. We applied the same analysis described earlier to the training set: of the 4133 genes for which we had expression data for the training set, 722 passed filtering conditions at an SD of 1. We performed SAM using these 722 genes; as FDR equals to 0.00000, SAM-identified 99 genes whose expression data accurately separated the three groups (control, unaffected family and autoimmune disease) with >90% support by ST HCL analysis (Fig. 6A). Next, we determined whether the gene expression profile of these 99 genes could be used to discriminate the three groups in the test set. To do so, we applied ST HCL to the test set, using the expression levels of the 99 genes identified by the training set analysis (Fig. 6B). The autoimmune disease group was separated from the control and unaffected family member groups with >90% support, except control 21, which clustered with RA08 in the autoimmune groups. Under the node shared by the control and unaffected family groups, the control group was clustered together and separated from the unaffected family member group. Next, we used the gene expression profile of the 99 genes to perform ST HCL with the total sample pool (training set and test set). Under these conditions, individuals with autoimmune disease clustered together with >90% support and samples from the training and test sets were distributed with each other under the autoimmune disease group node (Fig. 6C). Similarly, the control group was clustered together and separated from the unaffected family member group, except for control 21, which was clustered with SLE05 and RA08 in the autoimmune group. Within each group, samples from training and test sets were intermingled with each other. Strikingly, 80 of the 100 genes identified by SAM, using the entire data set, were also identified from the training set only. EASE analysis of the 99 genes identified by SAM in the training set revealed the same over-represented functional categories (P<0.05, Fisher's exact test, Bonferroni correction applied), as identified from the previous 100 genes (Table 1). These results indicate that the gene expression signature identified by these methods will discriminate individuals with autoimmune disease from unaffected individuals, both family members and non-family members of affected individuals.
Previously, we characterized a single common gene expression profile present in the PBMC of individuals with four different autoimmune diseases, RA, SLE, MS and IDDM (13). A portion of this gene expression profile is also present in unaffected first-degree relatives of individuals with autoimmune disease (28). Our interpretation of these results is that a portion of these variations in gene transcript levels is associated with familial resemblance rather than clinical manifestation of autoimmune disease, suggesting that it may be of genetic origin. In this article, we have tried to identify a set of genes whose expression profile exclusively reflected the presence of autoimmune disease without the influence of familial resemblance. To do this, we pre-classified control individuals (n=8), unaffected family members (n=8) and individuals with autoimmune disease (n=54) as three independent groups. We reasoned that increasing the sample size of the autoimmune class might help discriminate between the autoimmune class and the control and unaffected family member classes. With the assistance of supervised SAM, we successfully identified 100 genes whose expression profiles can discriminate individuals with autoimmune disease from unaffected family members and controls, using unsupervised HCL algorithms. By validation analysis, we confirmed that the gene expression signatures identified by this method can accurately discriminate individuals with autoimmune disease from unaffected individuals who are either family members or non-family members of affected individuals.
PCA is another measure that provides an overview of gene expression data. PCA essentially confirms the HCL results. This analysis segregates samples into three groups: control individuals, unaffected family members and individuals with autoimmune disease, on the basis of their distribution in three- or two-dimensional space. Samples from control individuals and unaffected family members are also grouped in closer proximity to one another than are individuals with autoimmune disease and unaffected family members or controls. The similarity among different groups measured in GDM also demonstrates the similarity of gene expression profiles among different groups that determines their distribution in the PCA three-dimensional space.
The 100 genes identified by SAM consist largely of previously identified autoimmune signature genes (13) and their expression levels are much lower in individuals with autoimmune disease than in control individuals or unaffected family members. EASE analysis reveals that one of the most over-represented biological process categories in the 100 genes identified by SAM is the ‘DNA damage response’ category, which includes both TP53 and BRCA1 genes. The p53 protein, which is also under-expressed in RA lymphocytes, is a central mediator of cellular responses to stress able to induce either cell cycle arrest or apoptosis, depending upon the degree of stress or damage (30,31). Defects in lymphocyte apoptosis may contribute to the development of autoimmunity. For example, the MRL murine strain has a mutation in Fas and exhibits defects in apoptosis and develops an autoimmune-like syndrome (32). Studies in murine models of collagen-induced arthritis demonstrate that loss of p53 function contributes to more severe lymphocytic infiltration into tissues and more severe joints destruction (33). In addition, both viral and cellular factors can interfere with p53 expression and function (34), and the consequences may affect the normal expression of p53 as well as p53 effector genes, Gadd45a or p21, and this may also contribute to the development of autoimmunity (35,36).
In this analysis, we have RA patients with early disease and established disease. Expression levels of the 100 genes identified by SAM are highly under-expressed in both groups when compared with that of the control and unaffected family member groups and exhibit similar expression patterns in the two RA groups. We conclude from this comparison that changes in gene transcript levels must occur at the time of disease onset or very shortly after disease onset. Although we do not know the precise mechanism that causes these changes in gene expression after disease onset, it may arise from cell intrinsic or extrinsic mechanisms. If one considers how this may occur in T lymphocytes, intrinsic changes may arise from alterations in lymphoid progenitor cells, selection processes during development in the thymus or perhaps the establishment of a chronic infection in progenitor or peripheral T cells. Extrinsic mechanisms may result from alterations in the host environment, such as changes in the cytokine or chemokine milieu that lymphocytes face in the periphery or changes in other cell types that interact with lymphocytes. We believe that this will be a fruitful avenue of future investigation.
Gene expression signatures in autoimmune disease have been widely described (13–22). Differential expression of these genes has the potential to affect both the onset and the pathogenesis of autoimmune disease. However, because of the complex genetic characteristics of autoimmune disease, the high variability of the genetic character of the human population and the contribution of genetics to gene expression (24–27), the differential expression pattern of these genes cannot be attributed only to the presence of autoimmune disease. It may also reflect familial resemblance. Both components should be considered in any study of gene expression profiling in human disease. This may help us more accurately identify changes in gene expression that are strictly associated with disease without masking of the genetic components. However, it must be considered that the gene expression pattern that is strictly associated with disease onset may not arise without the contribution of the gene expression pattern that is governed by genetics or familial resemblance.
MATERIALS AND METHODS
This study consists of the following groups as described in the text. All autoimmune patients satisfied established criteria for diagnosis of their respective diseases. Human subject studies were approved by the committee for the protection of human subjects of the Vanderbilt University Institutional Review Board.
Eight healthy control individuals without active infection or family history of autoimmunity.
Fifty-four individuals with autoimmune diseases: SLE (n=19), RA (average disease duration of 10.5±2.6 years, n=9), ERA (average disease duration of 1.1± 0.3 years, n=17), IDDM (insulin dependent diabetes mellitus, n=5) and MS (n=4).
Eight unaffected family members of individuals with autoimmune diseases (SLE and RA).
Sample preparation and microarray procedures
Analysis procedures presented here comply with MIAME (minimal information about a microarray experiment) guidelines established by the Microarray Gene Expression Data Society (www.mged.org). PBMC were isolated from 20 ml heparinized blood on a Ficoll–Hypaque gradient. All samples were processed within 1 h of blood collection. Total RNA was isolated with Tri-Reagent (Molecular Research Center, Inc., Cincinnati, OH, USA) and 5 µg RNA was reverse-transcribed by reverse transcriptase (Superscript II, Invitrogen Corporation, Carlsbad, CA, USA) in the presence of 33P-dCTP. Labeled probes were purified using a Bio-Spin 6 Chromatography Column (Bio-Rad Laboratories, Inc., Hercules, CA, USA). Before hybridization, GeneFilters membranes (GF-211, Research Genetics/Invitrogen Corporation, Carlsbad, CA, USA) were washed in boiled 0.5% SDS, saturated with 5.0 ml Microhyb solution (HYB125.GF, Research Genetics/Invitrogen Corporation). Filters were treated with pre-hybridization reagents (5.0 µg Human Cot-1 DNA and 5.0 µg Poly dA, Invitrogen Corporation) in a hybridization roller tube (Midwest Scientific, St Louis, MO, USA) for 2 h at 42°C. Purified, labeled probes were denatured and added to roller bottles containing filters and pre-hybridization solution. GeneFilters membranes were hybridized overnight at 42°C. After hybridization, membranes were washed (1st and 2nd wash: 2X SSC, 1% SDS at 50°C for 20 min; 3rd wash: 0.5X SSC, 1% SDS at 55°C for 15 min). After washing, membranes were exposed to imaging screens for 24 h and the screens were scanned by a phosphorimager (Molecular Dynamics/Amersham Biosciences, Piscataway, NJ, USA). Acquired images by the phosphorimager were loaded into Pathways 4.0 software (Research Genetics/Invitrogen Corporation). The relative intensity of each spot on the membrane was determined and the microarray data set was subjected to further analysis, using the different analytical platforms. Microarray data have been deposited into GEO database, accession no. GSE3997.
Raw microarray data were imported into Cluster (version 3.0) (37) for pre-processing. For the analysis, we filtered the data by selecting those genes whose SD of expression levels across all samples exceeded 1.0. Filtered data were normalized and loaded into the TIGR MultiExperiment Viewer (MEV version 3.1) (38) from the Institute of Genomic Research (Rockville, MD, USA). The following data analysis modules of MEV were used for the analysis: HCL, SAM, ST clustering, PCA and GDM.
Significant analysis of microarray.
Microarray data were statistically analyzed using the SAM algorithm, which was specifically developed for the analysis of genome-wide expression data (39). Briefly, SAM uses the SD of repeated gene expression measurements to assign a score to each gene. It estimates an FDR by permutation of the data for a particular score. SAM analysis ascertains that genes identified as differentially expressed do not arise from a random fluctuation of the large quantity of data generated (39).
FDR was used to evaluate the statistical significance of the gene set identified by SAM. It is defined as the expected proportion of falsely identified genes (V) among the total number of identified significant genes (R) (FDR=V/R, V and R are calculated by the program). In this study, the FDR=0.00000.
ST was used to identify hierarchical trees and to show the statistical support for the nodes of the trees, on the basis of Jackknife re-sampling of the data.
Principal components analysis.
PCA is used to identify the key variables (or combination of variables), principal components, that represent a multi-dimensional data set (for example, a number of gene expression variables in a number of samples). The three most representative principal components are used to map each element to a three-dimensional viewer (40).
Gene distance matrix.
We used the GDM to calculate the distance metric (Euclidean distance) between two samples to provide an intuitive and comprehensive view of the distance (or similarity) between any two samples. These distances are depicted as a colored matrix representing all sample-to-sample distances. The intensity of the color represents the distance or similarity between the two samples.
Detailed descriptions of the applications of these modules are provided in the Results section and are also available in the TIGR MEV manual (38).
Biological process categorization by gene ontology
The Expression Analysis Systemic Explorer (EASE version 2.0) was used to search for common biological themes within gene lists generated by our microarray analysis. EASE assigns identified genes to ‘GO: Biological Process’ categories of the Gene Ontology Consortium (www.geneontology.org) (41–43) and categories are tested statistically (EASE: Fisher's exact test) to identify over-represented categories of identified genes within the biological process system. Significant functional categories are those with the number of list hits (LH) of at least 2, with a P-value<0.05 (EASE, Fisher's exact test with Bonferroni correction).
In our study, Euclidean distance was used to build the HCL (unsupervised, SAM supervised and ST). It was also used to calculate the distance of groups in the GDM (38). Euclidean distance was used to calculate the distance metric that reflects the distance between two objects in space. Euclidean distance extends to as many dimensions as present in the expression vectors to be compared. In our case, the expression vectors are represented by the values of expression of an individual gene over the range of samples (or, in GDM analysis of this study, the expression vectors are represented by the values of expression over the range of identified genes for an individual sample). The calculation of Euclidean distance is
Jackknifing takes each expression vector and randomly omits a sample. This method produces expression vectors that have one fewer sample and this is often done to minimize the effect of single outlier values (38,44).
We thank the health-care providers and patients who contributed to these studies for their help and cooperation. This work was supported by a grant from the National Institutes of Health (R42 AI53984).
Expression of SAM-identified 100 genes of all the samples is provided as an excel file, SAM100.xls, and expression of SAM-identified 99 genes from training set is provided as an excel file, SAM99.xls.
Conflict of Interest statement. None declared.
|RNA splicing||3||0.042||SF3A3; SIP1; SNRP70|
|DNA damage response||2||0.042||BRCA1; TP53|
|Amino acid catabolism||3||0.042||ASL; ASPA; BDH|
|Regulation of CDK activity||2||0.042||CDKN1B; CKS2|
|Response to toxin||2||0.042||BPHL; EPHX2|
|RNA splicing||3||0.042||SF3A3; SIP1; SNRP70|
|DNA damage response||2||0.042||BRCA1; TP53|
|Amino acid catabolism||3||0.042||ASL; ASPA; BDH|
|Regulation of CDK activity||2||0.042||CDKN1B; CKS2|
|Response to toxin||2||0.042||BPHL; EPHX2|
Biological process categories significantly over-represented (P<0.05; Fisher's exact test, Bonferroni correction applied) in 100 genes identified by SAM. LH: the number of identified genes within the specific category; EASE: P-value, Fisher's exact test, Bonferroni correction applied.