Abstract

Summary: Errors are prevalent in cDNA sequences but the extent to which sequence collections differ in frequencies and types of errors has not been investigated systematically. cDNA quality control, or cQC, was developed to evaluate the quality of cDNA sequence collections and to revise those sequences that differ from a higher quality genomic sequence. After removing rRNA, vector, bacterial insertion sequence and chimeric cDNA contaminants, small-scale nucleotide discrepancies were found in 51% of cDNA sequences from one Arabidopsis cDNA collection, 89% from a second Arabidopsis collection and 75% from a rice collection. These errors created premature termination codons in 4 and 42% of cDNA sequences in the respective Arabidopsis collections and in 7% of the rice cDNA sequences.

Availability: A web-based version of cQC, source code and revised cDNA collections are available at

Contact:raj@ag.arizona.edu

Supplementary information: Further text, tables and figures are available at the above website or on Bioinformatics online.

INTRODUCTION

Expressed sequence tags (ESTs) and full-length cDNA sequences have been used to confirm or revise computational annotations of genomic sequences, however, the quality of the original cDNA sequence datasets has not been investigated. Because ESTs and full-length cDNAs are generated by single-pass sequencing, errors are frequent. Substitutions, deletions and insertions can alter reading frames or introduce premature termination codons (PTCs). In addition, bacterial insertion sequences (ISs) (Hill et al., 2000), ribosomal RNA (Gonzalez and Sylvester, 1997) and chimeric cDNA sequences (Burke et al., 1998) can contaminate cDNA libraries. Thus, we developed cDNA quality control (cQC) to evaluate the quality of cDNA sequences and to provide corrected cDNA sequences.

PROGRAM OVERVIEW

cQC is a program written in Perl which

  • removes chimeric cDNAs and identifies cDNAs with similarity to rRNA sequences,

  • identifies cDNA sequences with similarity to bacterial IS elements for manual removal,

  • identifies cDNAs lacking sequence similarity to the genomic sequence,

  • identifies small-scale discrepancies (substitutions, insertions and deletions) in remaining cDNA sequences, calculates the number of occurrences in the 5′-untranslated region (5′-UTR), major open reading frame (ORF) and 3′-untranslated region (3′-UTR) and corrects the cDNA sequence to match the genomic sequence,

  • assesses whether these discrepancies create a frame shift or introduce PTCs in the ORF, and

  • generates a set of corrected cDNA sequences based on genomic sequence.

Arabidopsis thaliana (Arabidopsis) and Oryza sativa (rice) are good targets for analysis by cQC because high-quality genomic sequences and large libraries of full-length cDNA sequences are available. Genomic and cDNA sequences have been derived from the same highly inbred lines, minimizing the occurrence of allelic variants and therefore the number of incorrectly labeled discrepancies. Two Arabidopsis (Seki et al., 2002; Castelli et al., 2004) and one rice (Rice Full-length cDNA Consortium, 2003) full-length cDNA collections were analyzed by cQC.

Prevalence of cDNAs misaligning or not aligning to genomic sequence

After identifying cDNAs containing IS elements and rDNA sequences (Supplementary text and table S4), cQC compares cDNA sequences with genomic sequence using MegaBLAST (Zhang et al., 2000) and identifies clusters of exons by grouping high-scoring pairs that are proximal to each other within genomic sequence, resulting in (1) normal cDNAs, corresponding to a single cluster, (2) chimeric cDNAs, corresponding to two or more unrelated clusters and (3) cDNAs with no genomic counterpart (supplementary figure S1). The rice collection contained the highest proportion of sequences lacking a genomic counterpart (Table 1), probably due to the 78 gaps in its genome sequence (Yuan et al., 2005). Chimeric cDNAs occurred at a much higher frequency (∼1% of cDNAs) in the rice collection than in the Arabidopsis collections (Table 1), similar to what was found for rDNA-containing cDNAs, many of which are chimeric cDNA clones.

Table 1

Frequency of cDNAs misaligning to genomic sequence and consequences for protein prediction

cDNA collection No genomic counterpart % initial set, (nChimeric cDNAs % initial set, (nLacking similarity internal to cDNA % initial set, (nNo similarity at termini % initial set, (ncDNAs with small-scale discrepancies % final set, (nTotal PTCs and/or frameshifts % final set, (n
Arabidopsis-Ra 0.1 (6) 0.3 (36) 0.1 (8) 0.6 (79) 50.7 (6620) 5.4 (705) 
Arabidopsis-G 2.5 (540) 0.0 (0) 3.3 (711) 1.1 (232) 89.3 (17 881) 46.3 (9267) 
Rice 5.0 (1412) 0.8 (229) 0.4 (117) 1.0 (275) 75.3 (19 748) 8.7 (2281) 
cDNA collection No genomic counterpart % initial set, (nChimeric cDNAs % initial set, (nLacking similarity internal to cDNA % initial set, (nNo similarity at termini % initial set, (ncDNAs with small-scale discrepancies % final set, (nTotal PTCs and/or frameshifts % final set, (n
Arabidopsis-Ra 0.1 (6) 0.3 (36) 0.1 (8) 0.6 (79) 50.7 (6620) 5.4 (705) 
Arabidopsis-G 2.5 (540) 0.0 (0) 3.3 (711) 1.1 (232) 89.3 (17 881) 46.3 (9267) 
Rice 5.0 (1412) 0.8 (229) 0.4 (117) 1.0 (275) 75.3 (19 748) 8.7 (2281) 

aR denotes RIKEN sequences and G denotes Genoscope sequences

Next, sim4 (Florea et al., 1998) alignments of cDNA to genomic sequence distinguish cDNAs (1) lacking genomic sequence similarity at one or both ends, (2) lacking sequence similarity internally and (3) possessing continuous similarity. cQC removes the second category, trims the first category of misaligning terminal sequences and appends these to a cleaned sequences file. cDNAs with aberrant termini or internal irregularities represented 0.7–4.4% of the tested collections (Table 1). In total, cDNA sequences misaligning or not aligning to genomic sequence comprised 1–8% sequences in these cDNA collections (Supplementary table S5).

Small-scale discrepancies: substitutions, deletions and insertions

Finally, cQC identifies small-scale discrepancies between each cDNA and the corresponding genomic sequence, reports discrepancies in an altered sequences file, changes the cDNA sequence to match its genomic counterpart and adds them to the cleaned sequences file along with cDNAs showing perfect alignments.

Discrepancies were found in 51–89% of intact cDNAs (Table 1). Locations of discrepancies varied within cDNAs, and collections differed with respect to the frequency of discrepancies in the ORF as compared with the 5′- and 3′-UTRs (Supplementary table S6). In total, Genoscope sequences have ∼10-fold greater discrepancy rate than RIKEN sequences (Supplementary table S7). The effect of these changes on protein prediction was most marked in the genoscope collection's frequency of frameshift mutations and/or PTCs (46%) whereas the other collections' frequencies were more moderate, ranging from 5 to 9% (Table 1).

Clearly, the quality of full-length cDNA sequence collections can be quite variable. If this important resource is to be used effectively as a primary source of data, information about the quality of these collections will be valuable and corrected sequence collections must be available. cQC can provide this for any species for which there is a high-quality genome sequence, including highly redundant (e.g. 10×) draft genome sequences.

We thank the Biotechnology Computing Facility (BCF) for hosting the cQC website. This research was supported by a University of Arizona NSF IGERT Genomics Initiative fellowship to C.A.H. and T.J.W. Funding to pay the Open Access publication charges for this article was provided by University of Arizona.

Conflict of Interest: none declared.

REFERENCES

Burke
J.
, et al.  . 
Alternative gene form discovery and candidate gene selection from gene indexing projects
Genome Res.
 , 
1998
, vol. 
8
 (pg. 
276
-
290
)
Castelli
V.
, et al.  . 
Whole genome sequence comparisons and ‘full-length’ cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation
Genome Res.
 , 
2004
, vol. 
14
 (pg. 
406
-
413
)
Florea
L.
, et al.  . 
A computer program for aligning a cDNA sequence with a genomic DNA sequence
Genome Res.
 , 
1998
, vol. 
8
 (pg. 
967
-
974
)
Gonzalez
I.L.
Sylvester
J.E.
Incognito rRNA and rDNA in databases and libraries
Genome Res.
 , 
1997
, vol. 
7
 (pg. 
65
-
70
)
Hill
F.
, et al.  . 
An estimate of large-scale sequencing accuracy
EMBO Reports
 , 
2000
, vol. 
1
 (pg. 
29
-
31
)
Rice Full-length cDNA Consortium
Collection, mapping and annotation of over 28 000 cDNA clones from japonica rice
Science
 , 
2003
, vol. 
301
 (pg. 
376
-
379
)
Seki
M.
, et al.  . 
Functional annotation of a full-length Arabidopsis cDNA collection
Science
 , 
2002
, vol. 
296
 (pg. 
141
-
145
)
Yuan
Q.
, et al.  . 
The Institute for Genomic Research Osa1 rice genome annotation database
Plant Physiol.
 , 
2005
, vol. 
138
 (pg. 
18
-
26
)
Zhang
Z.
, et al.  . 
A greedy algorithm for aligning DNA sequences
J. Comput. Biol.
 , 
2000
, vol. 
7
 (pg. 
203
-
214
)
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Comments

0 Comments