## Abstract

We present a multifactorial, multistep approach called genomic convergence that combines gene expression with genomic linkage analysis to identify and prioritize candidate susceptibility genes for Parkinson's disease (PD). To initiate this process, we used serial analysis of gene expression (SAGE) to identify genes expressed in two normal substantia nigras (SN) and adjacent midbrain tissue. This identified over 3700 transcripts, including the three most abundant SAGE tags, which did not correspond to any known genes or ESTs. We developed high-throughput bioinformatics methods to map the genes corresponding to these tags and identified 402 SN genes that lay within five large genomic linkage regions, previously identified in 174 multiplex PD families. These genes represent excellent candidates for PD susceptibility alleles and further genomic convergence and analyses.

## INTRODUCTION

Parkinson's disease (PD; MIM 168600) is a progressive neurodegenerative disorder affecting over half a million people in the USA alone (1). Pathologically, PD is characterized by marked degeneration of the dopaminergic neurons in and deriving from the substantia nigra (SN). Identification of susceptibility genes for PD would greatly improve our understanding of this disorder and hopefully will allow eventual prevention of the disease.

We present here a detailed gene expression profile of the SN and adjacent midbrain tissue, using serial analysis of gene expression (SAGE) from two normal control individuals, aged 72 and 81, and the subsequent genomic localization of all detected transcripts. Two different individuals were analyzed to determine the variability of gene expression in this tissue. SAGE is a powerful open source method for profiling the full complement of transcripts expressed in a given tissue (2). One of the great strengths of SAGE is that it can identify novel genes that have not previously been characterized and, unlike candidate gene approaches, requires no preconceptions concerning disease mechanisms. SAGE is similar in concept to EST sequencing, but, rather than sequencing each transcript individually, a small sequence called a tag is isolated from each transcript and linked together with other tags into long concatemers. These concatemers are then cloned and sequenced, so that tags corresponding to 20–40 transcripts can be sampled in a single sequencing reaction. The gene corresponding to each SAGE tag can be identified by comparison with known cDNA or EST sequences.

While both linkage and expression analyses are powerful on their own, the number of possible genes they present as candidates for PD or any complex disorder remains extremely large. Thus, focusing and prioritizing effort on specific candidate genes is a key to success using these techniques. We have recently completed a genomic linkage analysis of a set of 174 multiplex PD families (3). This analysis generated five distinct linkage intervals that together contain almost 5000 genes that could be candidates for susceptibility genes for PD. Creative approaches must be employed to prioritize such a large number of genes. In mathematics, the process of adding new values or processes to eventually reach one fixed limit or value is called convergence. Thus we report the first steps in a multifactorial approach we term ‘genomic convergence’ that combines expression data, coupled with genomic linkage analysis, to identify and prioritize candidate susceptibility alleles for PD.

## RESULTS

### SAGE libraries

Total RNA was harvested from SN tissue from two normal control individuals aged 72 and 81. SAGE was performed to profile genes transcribed in these tissues. A total of 47 994 tags were sequenced: 23 766 tags from library 542, and 24 228 tags from library 543. These tags represented 15 516 individual transcripts. High GC content indicative of partial denaturation of ditags has been previously reported in some SAGE libraries (4); however, a 44.8% GC content in our libraries indicates that AT-rich tags were retained during library construction. While extracting SAGE tags from raw sequence data, the parameters of the eSAGE software were adjusted so that a tag was recorded only if each base within that tag had a minimum PHRED quality score of 20. Using a set of perl scripts (kindly provided by Bob Lyons), we calculated average PHRED values of 37 (library 542) and 38 (library 543) for the sequences from which SAGE tags were extracted. This allows us to estimate that ∼57 tags from library 542 and 46 tags from library 543 may contain sequencing errors (5). Thus the great majority of the 10 231 tags observed only once in these libraries represent actual transcripts that are expressed at low levels, although Taq polymerase errors during amplification could introduce additional mis-specified tags. This ability to quantify library error rates using sequence quality scores is one of the strengths of the eSAGE software.

### Validity of the libraries

We evaluated the validity of our SAGE libraries by searching for genes known to be expressed in the SN. D1 dopamine receptor-interacting protein (CALCYON), dopamine receptor D2 (DRD2), dopamine receptor D4 (DRD4), β-synuclein (SNCB), and γ-synuclein (SNCG) are all expected to be expressed in the substantia nigra, and tags corresponding to all were found in the SN SAGE libraries. Of special note, tags arising from α-synuclein (SNCA) transcripts are found in our SAGE libraries at a level of 83 tags per million. Mutations in SCNA have been shown to cause autosomal dominant PD (6). Similarly, ubiquitin carboxyl-terminal esterase L1 (PARK5) is expressed at 146 tags per million in our libraries, and mutations in PARK5 have also been reported to cause PD (7). The fact that SAGE analysis of the SN can detect the presence of these PD-causing genes supports our paradigm that additional PD susceptibility genes can be detected through this kind of expression analysis. It should be noted that tags corresponding to parkin (PARK2) were not found; however, this gene is expressed at low levels in the midbrain (8), and has previously been found in only a single astrocytoma SAGE library (Gene Expression Omnibus GSM697).

### Genes expressed in the SN

Approximately 71% of the nigral tags could be assigned to UniGene clusters, similar to other recently described SAGE libraries (9). The 20 most abundantly expressed SAGE tags in the pooled SN libraries are shown in Table 1. (The full SAGE libraries for samples 542 and 543 have been submitted to the Gene Expression Omnibus and are thus publically available.) Table 2 summarizes the known or inferred gene functions of the 452 tags appearing more than 10 times in the pooled nigra SAGE libraries (0.017% of total transcripts). We have excluded from this analysis ambiguous tags that map to more than one gene, and 89 tags corresponding to genes of unknown function. Some genes have more than one legitimate tag owing to alternative polyadenylation sites or alternative splicing patterns—for this analysis we have pooled such tags together. The largest functional categories are genes involved in protein synthesis (17%) and degradation (6%), structural proteins (7%), and growth factors or genes involved in cell cycle control (7%).

The two midbrain SAGE libraries contain 4522 orphan tags that do not map to any known UNIGENE clusters (build 151). In fact it is surprising that several of the most abundant tags in our analysis are orphan tags. These tags could arise from at least five possible sources: (1) novel mRNA splice forms of known genes; (2) novel polyadenylation sites of known genes; (3) transcripts of previously unknown genes; (4) incomplete NlaIII digestion or cDNA priming from internal stretches of A; and (5) technical sources such as Taq polymerase or sequencing errors. As discussed above, we expect only about 100 sequencing errors in these libraries, and technical errors would primarily create singleton tags; therefore, it is likely that the 913 non-singleton orphan tags represent actual transcripts. Alternative patterns of RNA splicing and polyadenylation in the substantia nigra may well play some functional role in that tissue. For example, alterations in the splicing pattern of the microtubule associated protein tau (MAPT) transcript have been shown to cause the chromosome 17-linked form of frontal temporal dementia with parkinsonism (FTDP, OMIM 157140) (10). Therefore, analysis of orphan tags in the SN will reveal much about RNA processing in this tissue. Finally, orphan tags may represent entirely novel transcripts. The ability to detect such novel transcripts is one of the great advantages that SAGE offers, relative to other methods of gene expression analysis such as microarrays.

Several methods can be used to elongate orphan SAGE tags (11,12). Analysis of over 300 orphan tags from bone marrow cells has shown that almost 70% of such tags represent novel genes (13). We have used one such technique called Reverse SAGE to elongate tag CACCTAATTG (the sixth most abundant tag in the pooled SN SAGE libraries). Unexpectedly, we obtained 116 bp of sequence identical to the mitochondrial gene ATPase 6 (GI:13009, Swiss-prot:P00846). This observation appears to reveal a shortcoming of current methods for mapping SAGE tags to genes: NCBI's tag-to-gene mapping files are based on the UNIGENE clusters. Because all mitochondrial sequences are removed prior to construction of UNIGENE clusters, SAGE tags associated with mitochondrial genes cannot be identified. We are actively investigating the identity of additional orphan tags, with special emphasis on the 10 remaining orphan tags that are among the 20 most abundantly expressed tags in the SN.

### Statistical significance of tag count differences

In most libraries, the majority of SAGE tags are present at low levels (tag number ≤15) (14), so methods for significance testing in SAGE data must be robust when comparing low tag counts. In this study, we employed both chi-squared and Fisher exact tests. The chi-squared test requires an expected cell count ≥5 for each of the cells being compared, while the Fisher exact test performs better in the case of lower cell count. Man et al. (14) conducted a power study to compare chi-square test, Fisher exact test, and a Bayesian approach (15) for SAGE data analysis. They concluded that the chi-squared test has 5–10% higher power than either the Fisher exact test or Audic and Claverie's Bayesian approach for SAGE data. Furthermore, the chi-squared test gives a correct type I error rate even in the low tag count case. However, the Fisher exact test should be used when the sample size is small, because the chi-squared test frequently concludes a result is significant, whereas the Fisher exact test does not for cell counts ≤5. Here, we used a more conservative cutoff, performing the Fisher exact test for cases having tag counts ≤10. We believe our strategy, a combination of chi-squared test and Fisher exact test for the rare tag count case, is robust to detect the transcription difference between two samples.

### Comparison of SN expression between two individuals

Statistical analysis of our SAGE results shows the patterns of gene expression are extremely similar in these two control SN samples. After normalization of tag abundance in the two libraries, both chi-squared and Fisher's exact test were used to calculate the significance of expression differences. The chi-squared test identified 177 tags as significantly different (P<0.05). Applying the Fisher exact test in cases with low tag counts reduced this number to 138. After correction for 15 516 independent tests (the total number of SAGE tags in the two libraries) using the false discovery rate method (α=0.05 and P0=1 (16), only two tags were differentially expressed. One of these tags (TCCCGTACAT) does not map to any known gene, but has been found in numerous SAGE libraries from a variety of tissues. The other tag that differs significantly between the two libraries (AAAAAAAAAA) is highly redundant, mapping to 858 UNIGENE clusters.

Using the modified false data rate correction for 1308 independent tests (α=0.05 and P0=1) identifies a total of four tags whose expression levels differ significantly between these two samples (Table 3). In addition to the two tags previously discussed, these include one additional orphan tag that does not map to any known UNIGENE clusters, but has been found in numerous SAGE libraries from a variety of tissues, and one tag corresponding to Unigene cluster 352628. The gene corresponding to this cluster encodes ADAMTS16, a disintegrin-like and metalloprotease (reprolysin type) with a thrombospondin type 1 motif. This gene maps to chromosome 5 and is not located within any of our PD linkage peaks. The two differentially expressed orphan tags are being further characterized by the isolation of additional 3′ transcript sequence, as described above.

### Mapping genes expressed in the SN to regions of linkage

We have recently completed genomic linkage analysis of a set of 174 multiplex families in which PD is segregating (3). This analysis generated five distinct linkage intervals, averaging 33 Mb in size, and queries of the human genomic assembly (UCSC build 31) reveal that 4675 UNIGENE clusters map within these intervals—far too many for detailed follow-up analysis or individual tissue expression analysis. The SAGE expression profiling presented here allows us to prioritize analysis of those genes that are expressed in the SN. Conversely, the usefulness of expression analyses like microarray and SAGE analysis often is greatly hampered by the lack of specificity it provides between comparisons of groups, providing an often overwhelming amount of information. Determining which expression changes are significant and should have valuable resources committed to their investigation can be a very difficult proposition. The use of intersecting data derived from these two powerful resources presents the first step towards a more efficient method of focusing effort on susceptibility genes.

Individual chromosome sequence files (build 31) were downloaded from the University of California Santa Cruz (UCSC). The genomic coordinates for each UNIGENE set were ascertained as described in the Methods section and stored in a MySQL database. Using this data, genomic coordinates were assigned to each tag expressed in the SN SAGE libraries. These coordinates were compared with those of the microsatellite markers defining boundaries of the linkage regions. In this way, we determined that only 402 of the 4675 genes mapping to linkage regions are actually expressed in our SAGE analysis of the SN, allowing a reduction of our follow-up effort by over 91%. These 402 genes may be found in supplementary Table 4, published as supporting information on the Human Molecular Genetics website.

## DISCUSSION

We have presented a SAGE gene expression profile of the human SN and surrounding midbrain tissues—a region of the brain critically affected in individuals with PD. The resulting detailed picture of midbrain RNA transcripts provides insight into the patterns of metabolic activity in this tissue. We have also presented a novel two-tiered method for measuring statistically significant differences in SAGE expression data: P-values are calculated using chi-squared analysis supplemented by the Fisher exact test in cases of low tag counts. We have proposed correcting these P-values for multiple testing in two different ways: a conservative approach that counts all expressed SAGE tags as independent tests, and a second approach that examines only those tags expressed at greater than 0.01% of total transcripts. Using the second method, we have identified only four transcripts that are expressed at significantly different levels in the midbrain tissue of two individuals. This is in striking contrast to a recent analysis of peripheral retina of two control individuals that showed 56 tags (36 known genes and 20 orphan tags) expressed at significantly different levels (9). The authors of that study used a chi squared analysis with a Bonferroni correction for multiple testing. If we apply the same approach to the analysis of our data, only two tags are expressed at significantly different levels. One possible explanation for the difference in interpersonal variation is that the retinal tissue donors were of different ages, whereas our two midbrain tissue donors were age-matched. It is also possible that gene expression is simply more variable in the retina than in the midbrain. A thorough understanding of the extent of variability of gene expression in normal tissues is critical to evaluate the effectiveness of gene expression profiling in identifying candidate susceptibility genes. The small degree of variability we see between the SN of two control individuals would suggest that significant changes between future studies of SAGE between PD and control SN are likely to reflect the disease process.

The most appropriate method for testing statistical significance of expression differences may vary from one application to another. For example, the medical use of expression analysis of tumor tissue for choosing treatment strategies might require the most conservative definitions of statistical significance. In contrast, we are using expression analysis to help prioritize analysis of positional candidates for PD susceptibility genes. For this application, a less stringent measure of significance is not only appropriate, but may be advantageous, especially if the genes identified are to be subjected to additional testing.

We have combined our SAGE expression profiling of the midbrain with the results of our previously reported PD linkage analysis (3) to create a powerful methodology for identifying and prioritizing candidate susceptibility genes for Parkinson disease—the first step in a process of genomic convergence. We have identified genes that map to PD linkage intervals, and that are also expressed in a tissue that is intimately involved in PD pathogenesis. This analysis results in a significant reduction in laboratory effort, as fewer than 10% of the genes within linkage regions are actually expressed in the midbrain. Neither SAGE nor linkage analysis requires prior knowledge or assumptions about the disease mechanism: in fact, the process described here is most likely to succeed if the data are allowed to drive the convergence, with minimal influence of a priori hypotheses into the mechanism of PD.

In this study, we have analyzed tissue from two normal individuals well within the age at onset of PD. It can certainly be debated whether susceptibility genes involved in the initiation or evolution of disease will be differentially expressed, or expressed at all, at any single timepoint. Further, we have analyzed only control tissue, not tissue from PD patients. However, in a complex disease such as idiopathic PD, it is as likely that increased disease risk is conferred by the presence of an allelic variant of a susceptibility gene, rather than by a dramatic change in expression of that gene. There is no reason to believe that the expression level of this susceptibility gene would necessarily be altered; in fact, such expression changes may reflect the downstream pathophysiology of PD rather than the initial events leading to disease. By analogy to Alzheimer disease, an individual with an APOE 4/4 genotype is at greater risk of developing disease than an individual with an APOE 2/2 genotype. However, there is no evidence that overall APOE levels differ between these two individuals. An analysis of normal control individuals would have identified APOE as a gene expressed in hippocampus that lies within the Alzheimer disease chromosome 19 linkage peak. Therefore, the fact that genes are expressed in the normal SN, coupled with the fact that they lie within linkage peaks, makes them excellent candidates for PD. Certainly, there are compelling reasons to examine gene expression in the SN of PD patients, and that work is ongoing in our laboratory. We will continue to incorporate the concept of genomic convergence to reduce the number of potential candidates. Potential genomic convergence approaches besides expression changes in PD patients include haploblock analysis, biological relevance and uniform association mapping. This approach has identified several classes of candidate PD susceptibility genes.

Three 70 kD a heat shock proteins (4, 5 and 9B) map to regions of linkage and are expressed in the midbrain. Heat shock proteins are a highly conserved set of polypeptides that appear to play a crucial role in thermotolerance and protection from cellular damage resulting from a variety of physiological stresses (17). Chaperones of the heat shock protein 70 and 40 families are sequestered into polyglutamine-containing inclusions found in several neurodegenerative diseases. It has been shown that the constitutive expression of chaperones can mitigate polyglutamine aggregation toxicity (18) and rescue motor neurons from axotomy-induced cell death (19). In a Drosophila model for PD, overexpressing human α-synuclein in transgenic flies in a pan-neural pattern resulted in formation of α-synuclein-rich filamentous protein aggregates that resembled Lewy bodies. This phenotype was accompanied by a significant age-dependent loss of dopaminergic neurons manifested by dysfunctional locomotive behavior, particularly age of onset retardation in negative geotactic response (20). This phenotype was substantially reversed by co-expression of either human or fly chaperone Hsp70 (21). The chaperones expressed in the SN may play just such a neuroprotective role.

Ubiquitination pathways may well be involved in the pathogenesis of PD: parkin is an E3 ligase, and Lewy bodies contain many ubiquitinated proteins. Four genes encoding proteins in the ubiquitination pathway are localized to regions of linkage as well as being expressed in the midbrain: ubiquitin-specific protease 20 (Hs.5452), ubiquitin B (Hs.183842), and ubiquitin conjugating enzymes E2B and E2D2 (HS.811 and Hs.108332). Finally, three subunits of the proteasome complex are found in our SAGE libraries and map to linkage intervals: macropain subunits beta type 3 (Hs.82793), beta type 7 (Hs.118065), and the non-ATPase subunit 11 (Hs.90744). This combination of expression analysis and linkage information is a powerful approach for the identification of candidate susceptibility alleles for PD, and will be strengthened even more when SAGE data are available from tissue of patients affected with PD.

We have combined high-throughput expression analysis (SAGE) with genomic linkage analysis to prioritize over 400 genes for additional analysis as PD susceptibility genes. This constitutes the first step in genomic convergence: a multifactorial, interdisciplinary approach to the identification of susceptibility genes for complex diseases like PD. We plan to continue this approach with additional convergence factors, comparing expression differences between the SN from PD and control individuals, between different PD patients, between PD patients and those with other Parkinsonisms, as well as gene–gene interaction studies, to name just a few. It is only through the synergism of such a variety of independent lines of evidence that we will obtain a detailed understanding of the etiology of complex diseases such as PD.

## MATERIALS AND METHODS

### Procurement of tissue

Human midbrain tissues used in this study were collected as normal controls by the Kathleen Price Bryan Brain Bank, in the Alzheimer's Disease Research Center, Duke University Medical Center (22,23). Patient 542 was an 81-year-old male with an intention tremor but no other evidence of any progressive neurological disease. In the final week of life, the patient became septic following major abdominal surgery and subsequently died from cardiac arrest. Brain tissue was collected with a postmortem delay of 3:15 h. Patient 543 was a 72-year-old female with no evidence of a movement disorder or any other neurodegenerative disease. Cause of death was bronchial pneumonia secondary to metastatic breast cancer. Brain tissue was collected with a postmortem delay of 3:00 h. At the time of autopsy, brain hemispheres were frozen in liquid nitrogen and subsequently stored at −80°C. Prior to isolation of total RNA for SAGE analysis, the substantia nigra and adjacent midbrain tissues were removed from each frozen hemisphere. The entire dissection procedure was performed on dry ice. Total RNA was extracted using the RNagents kit (Promega, Madison WI, USA). Yields of total RNA for samples 542 and 543 were 490 µg (1.30 g tissue) and 305 g (1.17 g tissue), respectively.

### SAGE library construction and analysis

We used a modification of the original SAGE protocol (24) to generate two human SAGE libraries. DNA was isolated from individual library clones using 96-well format Qiagen REAL minipreps, and sequenced with an ABI 3700 capillary sequencer using BigDye chemistry. SAGE tags were extracted from the .PHD files with eSAGE software (25), using a threshold value of PHRED 20 for each base in a SAGE tag. The identity of the transcript corresponding to each SAGE tag was determined using the reliable tag-to-gene mapping files derived from UNIGENE build 151 (National Center for Biotechnology Information, NCBI).

### Reverse SAGE

Reverse-SAGE is a PCR-based technique that allows to isolation of additional sequence between a SAGE tag and the 3′ end of a transcript. A detailed protocol is available online at www.sagenet.org/protocol/protocolsoftware.htm. We used the tag-specific primer 5′-TAC GGG GAC ATG CAC CTA ATT G-3′ in conjunction with this technique to isolate additional sequence associated with tag CACCTAATTG.

### Statistical analysis

We employed the chi-squared test and Fisher exact test to test the difference in tag counts between two samples (14). That is, for each particular Tag A, we arranged the data in a 2×2 table as below:

The tag counts for Tag A in samples 1 and 2 are designated as n11 and n12, respectively. The n21 and n22 are the sum of the remaining tags in library 1 and library 2, respectively. The chi-squared statistic for the 2×2 table is computed by the following formula and compared with 1 degree of freedom at the 0.05 type I error rate:

${\chi}^{2}\ {=}\ \frac{N_{{\cdot}{\cdot}}(n_{11}n_{22}\ {-}\ n_{12}n_{21})^{2}}{N_{1.}N_{2.}N_{.1}N_{.2}}$

In general, chi-squared analysis requires an expected cell count ≥5 (26). To be conservative, when any of the expected count values in the 2×2 table (n11, n12, n21, n22) was ≤10, we performed the Fisher exact test in addition to the chi-squared test. In the Fisher exact test, we fixed the marginal sum (i.e. N1., N2., N.1, N.2) and created all possible tables with the same marginal sum. The probability of observing each possible table configuration was computed as below:

$P\ {=}\ \frac{N_{1.}!N_{2.}!N_{.1}!N_{.2}!}{N_{{\cdot}{\cdot}}!n_{11}!n_{12}!n_{21}!n_{22}}$

The two-sided P-value of the Fisher exact test is the sum of all probabilities that are less than or equal to the probability derived from the original observed table. We used these methods to compare SAGE tag abundance in two control substantia nigra libraries. Since P-values were calculated for several thousands of genes simultaneously, we applied Benjamini and Hochberg's (16) false discovery rate (FDR) procedure to correct for multiple testing. Suppose that we simultaneously test n null hypotheses H1, H2,…, Hn on the basis of independent chi-squared or Fisher exact tests. The P-values were ordered as P(1)P(2)≤…≤P(n). If k is the largest integer such that

$P_{(k)}\ {\leq}\ \frac{k}{n}\ \frac{{\alpha}}{p_{o}}$

then we reject all Hi for i<k, where po= proportion of true Hi. For this analysis, we assume α=0.05 and p0=1 (i.e. all Hi are true). We have used two different values for the number of independent tests. The more conservative approach is to consider significance testing of each of the 15 516 different SAGE tags an independent test. Applying the FDR correction to each SAGE tag would require a P-value of 0.0000032 to achieve statistical significance for the most significant tag. As our goal is data exploration and convergence of evidence across multiple independent analyses, insisting on adherence to this level of correction defeats the purpose of the analysis. To derive a less conservative correction of the P-values, we propose including in this analysis only those tags that are expressed at greater than 0.01% of total transcripts (sum of tag counts in both libraries >5). This modified FDR correction is intermediate between no correction of the standard level of statistical significance (P<0.05) and a correction based on all expressed tags (P<0.0000032). In this case, there are 1308 tags (of a total of 16 516 tags) expressed above 0.01% of total transcripts, resulting in a modified statistical significance level of P<0.000038. The Fisher exact test is still used prior to this modified FDR correction for any comparisons involving a single cell with a tag count below 10.

### Mapping SAGE tags to genomic location

The 24 individual chromosome sequence files (build 29) were downloaded from the University of California Santa Cruz (UCSC). A representative sequence for each UNIGENE cluster (build 150) was obtained from ‘Hs.seq.unique’ (NCBI), and aligned with the chromosomal sequences using BLAT (available from UCSC). The results were filtered: match/Qsize ≥75%; score ≥0.7 [score=(match-mismatch-gap_number×5-gap_size×2)/Qsize]. Only the highest score was kept unless there were multiple identical scores ≥0.7. The corresponding genomic coordinates for each UNIGENE set were stored in a local MySQL database. In this way, 80 104 of the 104 024 UNIGENE clusters in build 150 were mapped to 81 414 locations, most with a 100% match. Clusters mapping to more than one location were considered to map to regions of linkage if at least one copy did so. After SAGE analysis, the UNIGENE set corresponding to each tag was used to query this database, and the genomic map coordinates were recorded. These coordinates were then compared with the coordinates for linkage peaks, determined by the positions of microsatellite markers with linkage results one lod decreased from the peak marker. Full documentation and associated perl scripts are available upon request.

We have used UNIGENE build 151 for tag-to-gene mapping of our SAGE data, and the very closely related build 150 to perform genomic mapping. While it would be ideal to use the same UNIGENE build for both purposes, this is not currently possible, as NCBI does not generate SAGE tag-to-gene mapping flatfiles for all UNIGENE builds, nor do they archive retired UNIGENE builds.

## SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG Online.

## ACKNOWLEDGEMENTS

We would like to thank Elliott Margulies for valuable assistance with eSAGE software and Bob Lyons for providing perl scripts for sequence analysis. Finally, we would like to thank the patients and their families, without whose generosity and support this research would not be possible. This work was supported by National Institutes of Health grant NS39764. J.E.S. is supported in part by National Institutes of Health grant ES00372. The Kathleen Price Bryan Brain Bank within the Duke Alzheimer Disease Research Center is supported by National Institutes of Health grant PH50 AG05128. Glaxo Welcome Corporation provided salary support for S.T. and C.H.

*

To whom correspondence should be addressed at: Center for Human Genetics, Duke University Medical Center, DUMC, Box 2903, Durham, NC 27710-2903, USA. Tel: +1 9196843508; Fax: +1 9196840919; Email: mike.hauser@duke.edu

Table 1.

The 20 most highly expressed SAGE tags in the pooled SN libraries

Tag Abundancea UNIGENE Description
CAAGCATCCC 31 330  Orphan tag
CCCATCGTCC 25 580  Orphan tag
CACCTAATTG 20 990  Orphan tag
AAAACATTCT 20 870 323562 Similar to implantation-associated protein
AGCCCTACAA 19 350 95243 Transcription elongation factor A (SII)-like 1
CTAAGACTTC 17 370  Orphan tag
ACTTTTTCAA 17 060  Orphan tag
ATTTGAGAAG 16 950  Orphan tag
TTCATACACC 16 930  Orphan tag
ACACAGCAAG 16 910 27115 Similar to SFRB_splicing factor  arginine/serine-rich 11
TCCCCTACAT 16 580 352628 Disintegrin-like metalloprotease (reprolysin type)
TCCCGTACAT 16 040  Orphan tag
ACCCTTGGCC 16 010  Orphan tag
CAACTAATTCb 15 510 75106 Apolipoprotein J
69997 Zinc finger protein 238
TTGGGGTTTCb 14 470 62954 Ferritin, heavy polypeptide 1
75850 WAS protein family, member 1
AGGTGGCAAG 14 430  Orphan tag
TAGGTTGTCTb 13 620 355549 Similar to IgE-dependent histamine-releasing factor
279860 Tumor protein, translationally-controlled 1
356466 ESTs
ACTAACACCC 13 620  Orphan tag
GTTGTGGTTAb 13 350 75415 Beta-2-microglobulin
99785 Homo sapiens cDNA: FLJ21245 fis, clone COL01184
ATGTGAAGAGb 13 330 111779 Secreted protein, acidic, cysteine-rich  (osteonectin)
126515 ESTs
Tag Abundancea UNIGENE Description
CAAGCATCCC 31 330  Orphan tag
CCCATCGTCC 25 580  Orphan tag
CACCTAATTG 20 990  Orphan tag
AAAACATTCT 20 870 323562 Similar to implantation-associated protein
AGCCCTACAA 19 350 95243 Transcription elongation factor A (SII)-like 1
CTAAGACTTC 17 370  Orphan tag
ACTTTTTCAA 17 060  Orphan tag
ATTTGAGAAG 16 950  Orphan tag
TTCATACACC 16 930  Orphan tag
ACACAGCAAG 16 910 27115 Similar to SFRB_splicing factor  arginine/serine-rich 11
TCCCCTACAT 16 580 352628 Disintegrin-like metalloprotease (reprolysin type)
TCCCGTACAT 16 040  Orphan tag
ACCCTTGGCC 16 010  Orphan tag
CAACTAATTCb 15 510 75106 Apolipoprotein J
69997 Zinc finger protein 238
TTGGGGTTTCb 14 470 62954 Ferritin, heavy polypeptide 1
75850 WAS protein family, member 1
AGGTGGCAAG 14 430  Orphan tag
TAGGTTGTCTb 13 620 355549 Similar to IgE-dependent histamine-releasing factor
279860 Tumor protein, translationally-controlled 1
356466 ESTs
ACTAACACCC 13 620  Orphan tag
GTTGTGGTTAb 13 350 75415 Beta-2-microglobulin
99785 Homo sapiens cDNA: FLJ21245 fis, clone COL01184
ATGTGAAGAGb 13 330 111779 Secreted protein, acidic, cysteine-rich  (osteonectin)
126515 ESTs

aAbundance is given as tags per million. Actual tags in library 47 994.

bTag maps to more than one UNIGENE cluster.

Table 2.

Summary of gene functions for 452 most abundant midbrain transcripts

Class Percentage
Apoptosis 1.4
Cell cycle/growth factors 6.6
Chaperones 3.3
Energy metabolism 8.8
Immune 4.1
Kinase phosphatase 3.6
Molecular motors 2.5
Other known 24.2
Protein synthesis 17.1
Protein translation 3.3
Receptors 1.4
Redox 0.6
Signal transduction 4.7
Structural 7.4
Transcription 5.0
Class Percentage
Apoptosis 1.4
Cell cycle/growth factors 6.6
Chaperones 3.3
Energy metabolism 8.8
Immune 4.1
Kinase phosphatase 3.6
Molecular motors 2.5
Other known 24.2
Protein synthesis 17.1
Protein translation 3.3
Receptors 1.4
Redox 0.6
Signal transduction 4.7
Structural 7.4
Transcription 5.0
Table 3.

Tags expressed at significantly different levels in substantia nigra from two control individuals

Tag Count in 542 Count in 543 P-value Threshold P-valuea Description
TCCCGTACAT 204 81 1.26×10−9 0.000140 Orphan
AAAAAAAAAA 89 24 8.80×10−8 0.000279 Redundant
ATAATACATA 10 35 3.03×10−5 0.000419 Orphan
TCCCCTACAT 208 111 3.54×10−5 0.000559 UNIGENE cluster 352628,  metalloprotease
Tag Count in 542 Count in 543 P-value Threshold P-valuea Description
TCCCGTACAT 204 81 1.26×10−9 0.000140 Orphan
AAAAAAAAAA 89 24 8.80×10−8 0.000279 Redundant
ATAATACATA 10 35 3.03×10−5 0.000419 Orphan
TCCCCTACAT 208 111 3.54×10−5 0.000559 UNIGENE cluster 352628,  metalloprotease

aSignificance threshold corrected with FDR procedure for 1308 statistical tests.

## References

1
Tanner, C.M. and Goldman, S.M. (
1996
) Epidemiology of Parkinson's disease.
Neurol. Clin.
,
14
,
317
–335.
2
Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. (
1995
) Serial analysis of gene expression.
Science
,
270
,
484
–487.
3
Scott, W.K., Nance, M.A., Watts, R.L., Hubble, J.P., Koller, W.C., Lyons, K., Pahwa, R., Stern, M.B., Colcher, A. and Hiner, B.C. et al. (
2001
) Complete genomic screen in Parkinson disease: evidence for multiple genes.
JAMA
,
286
,
2239
–2244.
4
Margulies, E.H., Kardia, S.L. and Innis, J.W. (
2001
) Identification and prevention of a GC content bias in SAGE libraries.
Nucl. Acids Res.
,
E60-0 29
(12).
5
Margulies, E.H., Kardia, S.L. and Innis, J.W. (
2001
) A comparative molecular analysis of developing mouse forelimbs and hindlimbs using serial analysis of gene expression (SAGE).
Genome Res.
,
11
,
1686
–1698.
6
Polymeropoulos, M.H., Higgins, J.J., Golbe, L.I., Johnson, W.G., Ide, S.E., Di Iorio, G., Sanges, G., Stenroose, E.S., Pho, L.T. and Schaffer, A.A. et al. (
1996
) Mapping of a gene for Parkinson's disease to chromosome 4q21–q23.
Science
,
274
,
1197
–1199.
7
Leroy, E., Anastasopoulos, D., Konitsiotis, S., Lavedan, C. and Polymeropoulos, M.H. (
1998
) Deletions in the Parkin gene and genetic heterogeneity in a Greek family with early onset Parkinson's disease.
Hum. Genet.
,
103
,
424
–427.
8
Solano, S.M., Miller, D.W., Augood, S.J., Young, A.B. and Penney, J.B. Jr (
2000
) Expression of alpha-synuclein, parkin, and ubiquitin carboxy-terminal hydrolase L1 mRNA in human brain: genes associated with familial Parkinson's disease.
Ann. Neurol.
,
47
,
201
–210.
9
Sharon, D., Blackshaw, S., Cepko, C.L. and Dryja, T.P. (
2002
) Profile of the genes expressed in the human peripheral retina, macula, and retinal pigment epithelium determined through serial analysis of gene expression.
Proc. Natl Acad. Sci. USA
,
99
,
315
–320.
10
Varani, L., Hasegawa, M., Spillantini, M.G., Smith, M.J., Murrell, J.R., Ghetti, B., Klug, A., Goedert, M. and Varani, G. (
1999
) Structure of tau exon 10 splicing regulatory element RNA and destabilization by mutations of frontotemporal dementia and parkinsonism linked to chromosome 17.
Proc. Natl Acad. Sci. USA
,
96
,
8229
–8234.
11
van den Berg, A., van der Leij, J. and Poppema, S. (
1999
) Serial analysis of gene expression: rapid RT–PCR analysis of unknown SAGE tags.
Nucl. Acids Res.
,
27
,
e17
.
12
Chen, J.J., Rowley, J.D. and Wang, S.M. (
2000
) Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification.
Proc. Natl Acad. Sci. USA
,
97
,
349
–353.
13
Chen, J., Sun, M., Lee, S., Zhou, G., Rowley, J.D. and Wang, S. M. (
2002
) Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags.
Proc. Natl Acad. Sci. USA
,
99
,
12257
–12262.
14
Man, M.Z., Wang, X. and Wang, Y. (
2000
) POWER_SAGE: comparing statistical tests for SAGE experiments.
Bioinformatics
,
16
,
953
–959.
15
Audic, S. and Claverie, J.-M. (
1997
) The significance of digital gene expression profiles.
Genome Res.
,
7
,
986
–995.
16
Benjamini, Y. and Hochberg, Y. (
1995
) Controlling the false discovery rate: a practical and powerful approach to multiple testing.
J. R. Stat. Soc.
,
57
,
289
–300.
17
Kregel, K.C. (
2002
) Heat shock proteins: modifying factors in physiological stress responses and acquired thermotolerance.
J. Appl. Physiol.
,
92
,
2177
–2186.
18
Bailey, C.K., Andriola, I.F., Kampinga, H.H. and Merry, D.E. (
2002
) Molecular chaperones enhance the degradation of expanded polyglutamine repeat androgen receptor in a cellular model of spinal and bulbar muscular atrophy.
Hum. Mol. Genet.
,
11
,
515
–523.
19
Kalmar, B., Burnstock, G., Vrbova, G., Urbanics, R., Csermely, P. and Greensmith, L. (
2002
) Upregulation of heat shock proteins rescues motoneurones from axotomy-induced cell death in neonatal rats.
Exp. Neurol.
,
176
,
87
–97.
20
Feany, M.B. and Bender, W.W. (
2000
) A Drosophila model of Parkinson's disease.
Nature
,
404
,
394
–398.
21
Auluck, P.K., Chan, H.Y., Trojanowski, J.Q., Lee, V.M. and Bonini, N.M. (
2002
) Chaperone suppression of alpha-synuclein toxicity in a Drosophila, model for Parkinson's disease.
Science
,
295
,
865
–868.
22
Hulette, C.M., Welsh-Bohmer, K.A., Crain, B., Szymanski, M.H., Sinclaire, N.O. and Roses, A.D. (
1997
) Rapid brain autopsy: the Joseph and Kathleen Bryan Alzheimer's Disease Research Center experience.
Arch. Pathol. Lab. Med.
,
121
,
615
–618.
23
Cummings, T.J., Strum, J.C., Yoon, L.W., Szymanski, M.H. and Hulette, C.M. (
2001
) Recovery and expression of messenger RNA from postmortem human brain tissue.
Modern Pathology
,
14
,
1157
–1161.
24
Virlon, B., Cheval, L., Buhler, J.M., Billon, E., Doucet, A. and Elalouf, J.M. (
1999
) Serial microanalysis of renal transcriptomes.
Proc. Natl Acad. Sci. USA
,
96
,
15286
–15291.
25
Marguiles, E. and Innis, J. (
2000
) eSAGE: managing and analysing data generated with serial analysis of gene expression (SAGE).
BioInformatics
,
16
,
650
–651.
26
Weir, B.S. (
1996
)
Genetic Data Analysis II: Methods for Discrete Population Genetic Data
. Sinaur Associates, Sunderland, MA.