Sampling the Arabidopsis Transcriptome with Massively-Parallel Pyrosequencing

Massively-parallel sequencing of DNA by pyrosequencing technology offers much higher throughput and lower cost than conventional Sanger sequencing. Although extensively used already for sequencing of genomes, relatively few applications of massively-parallel pyrosequencing to transcriptome analysis have been reported. To test the ability of this technology to provide unbiased representation of transcripts we analyzed mRNA from Arabidopsis seedlings. Two sequencing runs yielded 541,852 ESTs after quality control. Mapping of the ESTs to the Arabidopsis genome and to TAIR7 cDNA models indicated: 1) massively-parallel pyrosequencing detected transcription of 17,449 gene loci providing very deep coverage of the transcriptome. Performing a second sequencing run only increased the number of genes identified by 10% but increased the overall sequence coverage by 50%. 2) Mapping of the ESTs to their predicted full length transcripts indicated that all regions of the transcript were well represented regardless of transcript length or expression level. Furthermore, short, medium and long transcripts were equally represented. 3) 16,698 of the ESTs that mapped to the genome are not represented in the existing dbEST database. In some cases the ESTs provide the first experimental evidence for transcripts derived from predicted genes and for at least 60 locations in the genome pyrosequencing identified likely protein-coding sequences that are not now annotated as genes. Together the results indicate massively-parallel pyrosequencing provides novel information helpful to improve the annotation of the Arabidopsis genome. Furthermore, the unbiased representation of transcripts will be particularly useful for gene discovery and gene expression analysis of non-model plants with less complete genomic information.


INTRODUCTION
For approximately 30 years, sequencing of DNA by the dideoxy terminator strategy introduced by Sanger (Sanger et al., 1977) has provided the basis for almost all available information about nucleotide sequences.
Pyrosequencing is an alternative technology that detects the pyrophosphate (PPi) released during DNA polymerase catalyzed incorporation of nucleotides.
The PPi liberated with each nucleotide addition can generate light in a reaction coupled to ATP sulphurylase and luciferase. Although proposed as early as 1985 (reviewed by Ahmadian et al., 2006), only recently have instruments become available that solve several technical details and allow large scale use of this approach (Margulies et al., 2005). The GS20 instrument used in this study performs the sequencing reactions in a "massively-parallel" fashion which is referred to in the following text as "pyrosequencing". Double stranded DNA is fragmented, individual molecules are attached to nano-beads, amplified, and each bead is deposited in wells of a highdensity plate with picoliter reaction volumes. As sequencing reagents pass over the plate, light emitted from each well is recorded. Because typically over 300,000 wells simultaneously provide data for a collection of DNA fragments it is possible to obtain 20-30 mega bases or more of sequence information in a single 4.5 hour run. Currently read lengths from each DNA fragment are short (average 100-110 bp) compared to Sanger sequencing, however, this figure is expected to increase substantially with instrument, reagent and protocol improvements.
Most applications of pyrosequencing have involved analysis of genomic DNA (e.g., Poinar et al., 2006). The goal of this study was to evaluate the ability of pyrosequencing to provide information on transcript populations from plant tissues. Sequencing of cDNA copies of transcripts has provided one of the most cost effective approaches for gene discovery because most sequences obtained are protein coding. The absence of introns and intergenic regions greatly enhances the information content and eases the interpretation of the data. Sequencing of a few thousand randomly selected cDNA clones has often been the initial step that led to the identification of key enzymes specific for biosynthesis of a wide range of natural products (Ohlrogge and Benning, 2000;Weber et al., 2004). Hundreds of expressed sequence tag (EST) projects that span the phylogenetic diversity of biology have also provided rich datasets for comparative genomics (Barbier et al., 2005) including variation in protein sequences that allow identification of conserved motifs, active sites and enzyme specificity-determining residues (Mayer et al., 2005).
For this study, we chose to evaluate Arabidopsis because its genome sequence is complete, more than 700,000 conventional ESTs are available, and the genome annotation is the most advanced for any higher plant. In addition, we chose to analyze 8-day old seedlings for which the transcript population has been well characterized by microarrays (Schmid et al., 2005).
Together, these factors allow advanced comparisons of pyrosequencing data to genomic and transcript data. Because of the substantially different methods used to prepare DNA for Sanger and pyrosequencing there are a number of relative advantages and disadvantages of each approach. Pyrosequencing does not require cloning of the DNA and therefore avoids certain biases that can be introduced by enzyme steps or by instability of sequences in E. coli. However, it is not clear if other biases might be associated with the fragmentation, amplification or other steps associated with massively parallel pyrosequencing. In this study we addressed several types of potential bias and provide an analysis of the advantages and disadvantages of current massively-parallel pyrosequencing data compared to conventional EST sequencing and other methods of transcript profiling.

RESULTS
To isolate transcripts, RNA was extracted from aerial tissues of 8-day old light-grown Arabidopsis seedlings and mRNA was prepared by two rounds of oligo-dT purification. First strand cDNA was synthesized with oligo-dT primer and second strand following protocols of a commercial cDNA library preparation kit. After end-repair adaptors were ligated, approximately 3 µ g of the cDNA population was sheared by nebulization, and DNA sequencing was performed with the GS20 genome sequencing system (Margulies et al., 2005).

Access to data
Access to all EST data obtained in this study and tools for mining the data are facilitated through an Excel workbook that is available as supplemental data (Table S1)  pyrosequencing ESTs had at least one significant alignment to a TAIR7 gene model. These ESTs detected transcription of 21,877 cDNA models from 17,449 gene loci, which is 59% of the TAIR7 cDNA models. Over 10,000 of the 17,449 gene loci were represented by at least 3 ESTs and 2,867 were represented by more than 25 ESTs ( Figure S1). Performing a second sequencing run only increased the number of genes identified by 10% but did increase the overall sequence coverage by ~50% (from 7 MB to 10.3 MB). Microarray data indicate 55-67% of Arabidopsis genes are expressed in any single organ (Schmid et al., 2005). This fact together with the observation that a second sequencing run increased the number of genes identified by only about 10% suggests that two pyrosequencing runs detect at least 90% of all genes expressed in this sample. Since pyrosequencing recovered the expected range of ESTs and the second run approached saturation, one or two pyrosequencing runs provide very deep and nearly complete representation of transcripts expressed by Arabidopsis seedlings, including detection of genes with very low expression.

Pyrosequencing ESTs represent the full length of transcripts
Preparation of DNA for pyrosequencing involves random shearing of the DNA by nebulization to provide short fragments suitable for sequencing. The randomness of this shearing process for cDNA has not been adequately assessed. If some cDNAs were resistant to shear forces due to their size, less complete coverage of the sequence might occur. We therefore asked whether there was bias in the regions of the transcript that were represented by pyrosequencing ESTs or in the length of the transcripts that were represented. cDNAs were analyzed based on their expression level and on their length. Mapping of the pyrosequencing ESTs to their corresponding full length transcripts (TAIR7 cDNA models) indicated that all regions of the transcripts were represented by the ESTs. There appears to be a slight strand bias with 55 -60% of reads coming from the plus (same as mRNA) strand. We compared EST distributions representing short (<1000 nt), medium (1000-2000nt), and long (>2000 nt) transcripts. An example that compiles 154,379 ESTs corresponding to 1,053 transcripts of 1000-2000 nt in length is shown in Figure 1. We also examined the distribution of ESTs along the length of transcripts that were highly expressed (615-1,949 ESTs per cDNA), moderately expressed (100-113 ESTs per cDNA) and minimally expressed (10 ESTs per cDNA) ( Figure S2). Although ESTs mapping to the 5' end were in most cases more abundant than other regions, no other substantial bias of ESTs across different regions of the transcripts was observed. For short transcripts there was a slight bias toward higher representation of the middle of the transcript ( Figure S2). This suggests breakage of shorter cDNA sequences near the middle is favored.
Nevertheless, the bias toward the middle is not large and we conclude that other methods of cDNA preparation such as random priming would not substantially improve full coverage of transcripts. The observation of ESTs initiating from every percentile of the cDNAs, regardless of cDNA length or expression level indicates that pyrosequencing is capable of reconstructing complete cDNA sequences. To compare the efficiency of gene discovery by pyrosequencing to traditional EST approaches, we randomly selected 5 sets of 10,000 ESTs from the 734,725 ESTs in GenBank and examined how many unique loci were identified and how much genome sequence was covered by these ESTs. This number was chosen because the cost for sequencing of 10,000 ESTs is approximately equivalent to two consecutive pyrosequencing runs. On average, 10,000 randomly selected ESTs covered approximately 3 million non-redundant nucleotides of genome sequence and identified 5,540 unique loci. In comparison, a single pyrosequencing run identified 3 times as many genes and covered twice as much sequence.

Representation of chloroplast and mitochondrial transcripts
For 38 annotated mitochondrial ORFs at least one pyrosequencing hit was detected and with few exceptions for most of these also multiple Sanger ESTs exist. For 71 chloroplast ORFs, we found at least one pyrosequencing EST. Similar to mitochondrial ORFs, most chloroplast transcripts detected by pyrosequencing have previously been tagged by Sanger ESTs. Only a few pyrosequencing ESTs mapped to chloroplast or mitochondrial ribosomal RNAs (209 and 48, respectively), which indicates efficient removal of ribosomal RNA during oligo-dT purification of mRNA.
Pyrosequencing provides evidence for novel transcripts and transcript architecture 9,687 ESTs could be matched to the genome but did not match a predicted gene in TAIR7. Using BLASTX, these ESTs were searched against both the RefSeq protein database and the NCBI non-redundant protein database. 278 had significant protein matches against RefSeq and 545 had matches against nr (Table S1-B). After correction for those ESTs that aligned to more than one place on the genome, and multiple overlapping or adjacent ESTs, we identified approximately 60 locations in the genome that are represented by expressed sequences and are likely protein coding (based on hits to protein databases) but that were not annotated as genes in TAIR7 (Table S2). Because small peptides are under-represented in protein databases (Lease and Walker, 2006) and because many pyrosequencing ESTs represent non-coding 5' and 3'UTR's the tables likely underestimate the number of unannotated proteins expressed in Arabidopsis seedlings.
A specific example is shown in Figure 2. One hundred pyrosequencing ESTs and a number of Sanger ESTs map to the intergenic region between genes At1g65420 and At1g65430, a region of chromosome 1 that is not currently annotated as a gene. ). This putative gene thus represents a candidate for inclusion in a future version of the TAIR dataset. Further support for this gene comes from the fact that a related gene (Os08g0504500) is annotated in the genome of rice, encoding a protein that is to 66% identical with its Arabidopsis ortholog. No paralogs were found in the Arabidopsis genome, indicating it represents a single-copy gene.
A possible concern with pyrosequencing is contamination of cDNA with genomic DNA and hence the possibility that genomic DNA fragments are wrongly identified as transcribed sequences. However, Figure 3B shows that pyrosequencing ESTs mapping to At3g54830 are clearly reflecting (and thus verifying) the predicted exon/intron structure of this gene, hence they do represent processed transcripts, not genomic DNA. Visual examination (GBrowse) of ESTs mapping to over 100 genes supported this conclusion.
A specific example of novel transcript information is At3g11090, which is annotated as LOB-domain family protein. To date, no ESTs mapping to this gene have been identified.
However, 17 pyrosequencing ESTs unambiguously map to this locus, indicating this is indeed an expressed gene ( Figure 3A). Interestingly, 9 unique 17 bp signature sequences mapping to this gene have been previously retrieved by the Arabidopsis MPSSplus (Meyers et al., 2004;Nakano et al., 2006) project (http://mpss.udel.edu/at/GeneAnalysis.php?featureName=AT3G11090) and a basic expression level was detected by microarray analysis (Schmid et al., 2005). Given that expression of At3g11090 was detected by three independent experimental approaches, it is surprising that no ESTs for this gene were previously found. Since all three experimental approaches do not require cloning we posit that At3g11090 may be toxic or otherwise incompatible with cloning in E. coli and therefore not represented in the EST collection. Another specific example is the putative amino acid transporter At3g54830 ( Figure 3B). Also in this case

Application of pyrosequencing to analysis of gene expression: Digital northerns
Comparisons of the number of ESTs for a gene between different libraries or different genes in the same library can be a reliable indicator of relative gene expression provided the ESTs map unambiguously to a single gene location (Audic and Claverie, 1997 and extends to transcripts that represent less than 0.001% of the mRNA population. Of the 17,449 gene loci whose expression was detected, more than 10,000 were represented by at least 3 ESTs. We also compared the number of ESTs per locus to the microarray signal obtained with ATH1 arrays for aerial tissues of seedlings grown under similar conditions (Schmid et al., 2005).
The correlation coefficient for loci with more than 10 ESTs was 0.45. This correlation is similar to those observed in several other studies of SAGE versus microarray data (van Ruissen et al., 2005). The correlation coefficient did not increase when only genes with higher expression were compared, suggesting the lack of strong correlation is not due to the dynamic range but it is due to characteristic differences in the methods.

Assembly of pyrosequencing ESTs
In this study, pyrosequencing ESTs were mapped to a completely sequenced genome and their value for sequence annotation, gene discovery and transcript quantification is discussed. We also addressed the use of pyrosequencing for de novo sequencing of transcripts. To this end the pyrosequencing ESTs were assembled into contigs using three different tools, the Newbler assembler provided with the GS20 sequencer, CAP3 (Huang and Madan, 1999) and the stackPACK EST analysis pipeline (Miller et al., 1999) which uses d2_cluster  to partition (cluster) the ESTs and Phrap (http://www.phrap.org) to assemble each cluster.
Examples of the cluster results for several transcripts are shown in Figures 2-4. For all three methods relatively few full length cDNA sequences were reconstructed, even in cases where ESTs covering the entire predicted model were available. d2_cluster uses a transitive clustering algorithm based on similarity (96% in the default) over a large window (default 100 nt). While these parameters are appropriate for traditional ESTs they fail to adequately cluster ESTs generated by pyrosequencing as the overlapping regions of adjacent ESTs were too small to meet the threshold score for clustering. Reducing the window size while increasing the similarity did not significantly improve the clustering. CAP3 placed more ESTs in contigs than the other methods and created, on average, longer contigs than stackPACK, but still failed to produce full length contigs in the majority of instances where full coverage was possible given the available EST data. The Newbler assembler utilized the fewest ESTs of the methods tested and created the fewest contigs; however the average length of contigs assembled by Newbler was the longest.
Newbler generated significantly fewer short contigs than CAP3 or stackPACK. Additional examples of the assembly results with the three programs can be explored with the Genome Browser. This comparison indicated that although pyrosequencing is able to generate sufficient sequence data to completely represent the full length of many transcripts, the assembly programs we tested are unable to efficiently create full-length contigs.

DISCUSSION
The results presented above indicate that pyrosequencing provides a very rapid, low-cost  (Velculescu et al., 1995) and MPSS (Brenner et al., 2000) have in the past provided key information on transcripts, because of the much shorter sequences and other limitations of these techniques it is likely their use will decline for profiling of transcriptomes.
A single pyrosequencing run identified most of the genes expressed in eight-day-old Arabidopsis seedlings. Although performing a second run only increased the number of transcripts detected by 10%, the total unique sequence information increased 50%. This occurred because the additional ESTs yielded more comprehensive sequence coverage across the length of transcripts, particularly for those transcripts of genes with low expression levels. An additional benefit of multiple runs is derived from the increase in statistical accuracy available when using EST numbers to make comparisons of relative gene expression levels.
For Arabidopsis, well-characterized and widely used microarrays are available that represent a large proportion of the expressed genes. The cost of a pyrosequencing run is several fold higher than a microarray experiment and therefore pyrosequencing will in most cases not be the tool of choice for routine transcript analysis of Arabidopsis. However, pyrosequencing does have the advantage of providing data for the approximately 25% of Arabidopsis genes that are not currently represented or not accurately discriminated on available microarrays.
A recent study of Bainbridge et al. (2006) detected transcripts for 10,000 loci from a human prostate cancer cell line using a single pyrosequencing cycle that resulted in 181,279 ESTs (Bainbridge et al., 2006). This study also reported a bias toward representation of 5' and 3' ends and to the middle of transcripts. It was speculated that these biases resulted from the accessibility of ends to sequencing and from incomplete fragmentation of the cDNA. In our analyses we observed a higher number of ESTs from the 5' ends of all transcripts. In addition the 3' ends of long (>2000 nt) transcripts were more highly represented as would be expected from cDNA synthesis primed with oligo dT. However, plots of the distribution of ESTs across the length of the transcript indicated all regions were well represented. Greater representation of the middle of transcripts was primarily notable for short cDNAs. There was only a slight strand bias with 55 -60% of ESTs coming from the plus strand. We conclude that fragmentation of the cDNA's during nebulization did not introduce major bias in the representation of transcripts.
Therefore, other methods for preparation of cDNA libraries, such as random priming do not seem to be needed to provide complete coverage across the length of transcripts.
Approximately 3.5 % of the ESTs from our study that matched the Arabidopsis genome did not match ESTs already available in GenBank. In contrast, Emrich et al. (2007) recently reported that 30% of ESTs obtained by a single pyrosequencing run for maize shoot apical meristem did not align to any of ~680,000 maize ESTs available in GenBank. This much higher proportion could reflect the specialized cell type that was sampled, or perhaps the greater complexity of the maize genome.
Efficient reconstruction of longer sequence contigs from pyrosequencing ESTs requires a high degree of oversampling and unbiased representation of sequence fragments. This, in contrast to genomic sequencing, is inherently problematic with transcriptome sequencing because of the large dynamic range of gene expression levels that leads to massive redundancy for coverage of some highly expressed genes, whereas transcripts of genes with baseline expression levels are underrepresented. In our study, we found that 26% of all EST obtained from eight-day-old Arabidopsis seedlings were derived from only 25 highly expressed genes that are members of the Rubisco and LHC gene families, whereas over 5,000 genes were represented by less than 10 ESTs. If priority is on gene discovery and assembly of longer contigs rather than on assessing relative gene expression, it will likely be useful to normalize the cDNA population prior to sequencing to maximize coverage of less abundant transcripts present in the sample. In this regard, Cheung et al. (2006) performed a single sequencing run on a normalized cDNA population derived from mixed tissues of Medicago trunculata. Their sequencing yielded 23 MB of unique sequences which is approximately twice the amount of unique sequence information we obtained (10.3 MB) from two runs with a non-normalized library.
Our study also revealed that currently available software tools have problems with assembly of the very large numbers of short sequences provided by pyrosequencing. This was the case even for those abundant transcripts where thousands of ESTs could be aligned to provide essentially complete coverage. The inability to assemble contigs is thus in large part related to the short overlaps. Improvements in software are currently under development and will be particularly important for the application of pyrosequencing to transcripts from species without extensive genome information. The increase in sequence length to >200 nt expected from pyrosequencing instrument upgrades will also greatly facilitate assembly of full-length cDNA sequences.
The availability of very comprehensive data for the Arabidopsis genome and a large set of conventional ESTs provided a baseline for this evaluation of pyrosequencing data. A much greater advantage of pyrosequencing will be its application to EST sequencing for those species for which little or no genomic data are available. The ability to rapidly detect sequences for almost all genes expressed in a sample will provide a more comprehensive tool for gene discovery than conventional EST sequencing. For example, genes involved in natural product biosynthesis have frequently been discovered first by EST sequencing (e.g., Bao et al., 2002).
The lower cost and greater sequence coverage afforded by pyrosequencing will make it possible to more confidently identify candidate genes involved in biosynthetic pathways and will allow identification of genes with very low expression levels often missed by conventional EST projects. Finally, as more plant genome sequences become available, mapping of pyrosequencing ESTs to these genomic sequences will provide a particularly efficient means for experimental verification of predicted gene models and can also be used to train ab initio gene prediction programs.

Applications to proteomics
Currently, proteomic analysis of organisms lacking a fully sequenced genome is difficult.
This is due to the way modern proteomics data is analyzed using uninterpreted spectral assignments. This approach calculates an ideal mass spectrum for each peptide in a database and compares such spectra against observed spectra. This approach is fast enough to allow for the analysis of the thousands of spectra collected for a typical complex protein sample and thus makes the procedure amenable to high throughput analysis (Tabb et al., 2003;Hirano et al., 2004;van Wijk, 2004). The determination of the peptide sequence from the collected spectra (i.e., de novo sequencing) is generally considered too slow and error prone to be practical for large numbers of proteins (Baginsky and Gruissem, 2006;Pevtsov et al., 2006). Unfortunately, organisms and tissues that are very amenable to biochemistry and protein isolation are frequently not model species. The potential of, for example peas, for organelle proteomics is unexplored since the sequence information for pea is severely limited. Pyrosequencing technology allows researches to build custom sequence libraries for their organism and tissue of interest. Since the success of a proteomics project largely depends on the size and quality of the available sequence database, the lower cost and speed of obtaining such EST data using pyrosequencing will expand the number of organisms for which this condition can be met. For proteomics approaches, however, it will important to obtain longer EST contigs assembled from multiple reads to minimize the rate of false positive peptide identifications. To this end, either higher sequence coverage is required or supervised methods for contig assembly based on existing genome scaffolds need to be implemented.

Preparation of RNA and cDNA of Arabidopsis seedlings
Arabidopsis (Col) seeds were sown on soil mix, placed at 4ºC for 2 days and then germinated under continuous light (~150 umol/sec m 2 ) at 20ºC. After 8 days the above ground green tissue was harvested and immediately frozen in liquid nitrogen. Total RNA was extracted by grinding the frozen tissue with a mortar and pestle in the "single-step" acid guanidinium thiocyanatephenol-chloroform mixture as described by Chomczynski and Sacchi (1987) followed by two consecutive washes of the RNA-pellet with 3M sodium acetate (pH 6.0) as described by Logemann et al. (1987) to remove polysaccharides. Total RNA was checked for purity and degradation using the Agilent 2100 Bioanalyzer RNA chip (Agilent Technologies Inc., Santa Clara, CA) and stored in 80% ethanol.
mRNA was purified using the Illustra mRNA purification kit (GE Healthcare, Piscataway, NJ) . One mg of total RNA was re-dissolved in TE buffer and applied to a pre-equilibrated oligo (dT)-cellulose column. The poly(A) + RNA was eluted from the column and applied to a second column for another round of purification. After elution from the column the poly(A)+ RNA was stored as ethanol precipitate.
cDNA was synthesized using the Clontech Smart PCR cDNA Synthesis Kit (Clontech, Mountain View, CA). First strand cDNA synthesis was performed with oligo-dT primer in a total volume of 10µl as described in the provided protocol using 1µg mRNA. Double strand cDNA was prepared from 2µl of the first strand reaction by PCR (13 cycles) with provided primers in a 100µl reaction. The cDNA was purified using Qiagen QIAquick PCR purification spin columns (Qiagen, Valencia, CA) and was checked for purity and degradation using the Agilent 2100 Bioanalyzer DNA chip.
Gene model and EST mapping data were displayed with the Generic Genome Browser (GBrowse) developed by Lincoln Stein (Stein et al, 2002) and are available at: http://genomics.msu.edu/cgi-bin/gbrowse/A_thaliana .