Abstract

Expressed sequence tag (EST) sequences can provide a wealth of data for phylogenetic and genomic studies, but the utility of these resources is restricted by poor taxonomic sampling. Here, we use small EST libraries (<1,000 clones) to generate phylogenetic markers across a broad sample of insects, focusing on the species-rich Coleoptera (beetles). We sequenced over 23,000 ESTs from 34 taxa, which produced 8,728 unique sequences after clustering nonredundant sequences. Between taxa, the sequences could be grouped into 731 gene clusters, with the largest corresponding to mitochondrial DNA transcripts and gene families chymotrypsin, actin, troponin, and tubulin. While levels of paralogy were high in most gene clusters, several midsized clusters including many ribosomal protein (RP) genes appeared to be free of expressed paralogs. To evaluate the utility of EST data for molecular systematics, we curated available transcripts for 66 RP genes from representatives of the major groups of Coleoptera. Using supertree and supermatrix approaches for phylogenetic analysis, the results were consistent with the emerging phylogenetic conclusions about basal relationships in Coleoptera. Numerous small EST libraries from a taxonomically densely sampled lineage can provide a core set of genes that together act as a scaffold in phylogenetic reconstruction, comparative genomics, and studies of gene evolution.

Introduction

Current molecular systematics depends on polymerase chain reaction (PCR) amplification of a few “universal” genes to provide phylogenetic data. However, as the need for sequencing further genes is increasingly evident (Murphy et al. 2001; Wheeler et al. 2001; Philippe et al. 2004; Teeling et al. 2005), expanding PCR approaches to a wider selection of genes becomes difficult because of the need to develop new degenerate primers for the amplification of single-copy loci. While the growing number of genome sequences may eventually be used for phylogenetic inferences across a broad sample of taxa, expressed sequence tags (ESTs) provide a more immediately available source of genomic data (Rudd 2003).

Most publicly available ESTs have been generated for gene discovery or to complement genome sequencing efforts. Some ESTs have been compiled into sets of nonredundant clusters in public databases such as tigr (http://www.tigr.org/), PartiGeneDB (http://www.partigenedb.org/), and UniGene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene). However, species for EST analyses have rarely been selected based on taxonomic criteria, which limits their use for phylogenetic analyses and comparative genomics (but see the recent study of Parkinson et al. 2004b). A concerted effort to enlarge EST databases to encompass disparate taxa should alleviate these problems (Bapteste et al. 2002; Theodorides et al. 2002), and recent compilations of large multigene data sets combined from genome sequences and EST data have demonstrated the power of molecular sequences for resolving deep relationships in eukaryotes (Philippe, Lartillot, and Brinkmann 2005; Rodriguez-Ezpeleta et al. 2005). Here we explore the possibility of generating small EST databases for taxa specifically selected to obtain comprehensive coverage of target groups. We apply this approach to the Coleoptera (beetles), a group that includes nearly one-third of all known species of animals (Erwin 1982; Hammond 1992; Beutel and Haas 2000; Caterino et al. 2002) but where existing EST data are limited.

A critical problem for comparative studies is that ESTs from different taxa may not contain overlapping sets of genes. For example, given a conserved core of 6,089 orthologous genes in the genomes of Drosophila melanogaster and Anopheles gambiae (Zdobnov et al. 2002), the probability that 250 ESTs from each species retrieve a matching ortholog is only 1.68 × 10−3 (250/6,089 × 250/6,089) if all genes are equally represented. The challenge of matching orthologous genes between taxa is amplified by the low expression of many transcripts; sequencing of tens of thousands of ESTs in D. melanogaster (Rubin et al. 2000) or Bombyx mori (Mita et al. 2003) fell short of reaching a full complement of predicted genes. However, even relatively small EST data sets consistently recover a subset of genes with conserved roles in core biological processes such as DNA replication, transcription, and cell metabolism (Hsiao et al. 2001). These genes should be suitable for phylogenetic analysis across a broad sample of taxa.

The use of nuclear genes as a source of phylogenetic data requires an appreciation of the complex nature of genome evolution, involving gene loss, duplications, expansion of gene families, and functional diversification. Assignment of gene orthology is difficult even between fairly closely related groups such as the dipteran A. gambiae and D. melanogaster, where genes diversified independently in each lineage (Zdobnov et al. 2002). Increased taxon sampling can improve the confidence of orthology assignments by identifying the origin of gene copies, facilitating inferences on gene duplications, and clarifying the relationship between gene content and the diversity of lineages (Parkinson et al. 2004b).

Here, we test the utility of dense taxonomic EST sampling, generating relatively small numbers of ESTs (<1,000 clones) for each major group in the focal Coleoptera and several related groups of insects. Existing studies of basal relationships in the Coleoptera to date were based on the mitochondrial cox1 (Howland and Hewitt 1995) and the nuclear small subunit rRNA genes (Caterino et al. 2002), but the use of a single locus in these cases was insufficient to resolve the main phylogenetic questions. Novel sources of phylogenetic information are highly desirable and should preferentially rely on multiple single-copy nuclear genes. Using EST-based approaches that do not rely on degenerate PCR would be a great advantage in this diverse group of insects. We therefore used the Coleoptera to test critical questions about the feasibility of dense EST sampling for molecular systematics. Specifically, we investigated the minimum size of EST libraries necessary to produce sufficient overlap in gene representation between libraries and assessed what kind of genes show the widest representation across small EST libraries. Further, the degree of paralogy in EST data remains insufficiently known but is a critical issue if genes from different species libraries are used for phylogenetic reconstruction. The utility of the approach is shown here by producing phylogenetic trees for the basal groups of Coleoptera from 66 genes coding for ribosomal proteins (RP).

Materials and Methods

Insect Specimens, RNA Extraction, and cDNA Library Construction

Twenty-five species of insects, of which 14 were Coleoptera, and two outgroups were used for library construction (table 1). RNA was obtained from entire adult specimens, except for the use of larval wing discs in the butterfly Papilio dardanus (A. Cieslak and A. P. Vogler, unpublished data) and testes in the tiger beetles Cicindela litorea and Cicindela littoralis (J. Galian and A. P. Vogler, unpublished data). Seven published coleopteran EST libraries (Theodorides et al. 2002) were also included in the analysis. Molecular procedures followed Theodorides et al. (2002) using the SMART method (Clontech Laboratories, Mountain View, Calif.) and cloning of cDNA with the Topo TA cloning kit (Invitrogen, Carlsbad, Calif.). In total, over 31,000 clones were screened, and all plasmid inserts >600 bp were sequenced using BigDye technology on an ABI 3700 automated sequencer.

Table 1

List of Species, Their Taxonomy, and GenBank Accession Numbers for the ESTs


Class: Order 

 

 

Number of Automatedb
 
  
Number of Manual
 
  
    Suborder: Series: Family
 
Species
 
Accession Numbera
 
Contigs
 
Singletons
 
Sequences
 
Contigs
 
Singletons
 
Sequences
 
Insecta: Coleoptera         
    Archostemata: Micromalthidae Micromalthus debilis CV155742CV155959 24 108 132 26 124 150 
    Myxophaga: Sphaeriusidae Sphaerius sp. CV155960CV156656 159 181 340 193 165 358 
    Adephaga: Carabidae Carabus granulatus BQ474802BQ475107 77 90 167 72 89 161 
    Adephaga: Cicindelidae Cicindela campestris BQ475108BG475778 301 64 365 278 58 336 
 Cicindela litorea CV156657CV157115 150 72 222 158 63 221 
 Cicindela littoralis CV157116CV157483 86 106 192 106 107 213 
    Adephaga: Dytiscidae Meladema coriacea BQ476741BQ477288 123 166 289 122 164 286 
    Pol.: Staphyliniformia: Georissidae Georissus sp. CV157484CV158376 224 161 385 258 133 391 
    Pol.: Staphyliniformia: Silphidae Silpha atrata CV158377CV158395 14 10 17 
    Pol.: Staphyliniformia: Histeridae Hister sp. CV158396CV159219 185 141 326 192 130 322 
    Pol.: Scarabaeiformia: Scarabaeidae Scarabaeus laticollis CV159220CV160155 261 119 380 226 75 301 
    Pol.: Elateriformia: Elateridae Agriotes lineatus CV160156CV160927 171 208 379 203 198 401 
    Pol.: Elateriformia: Buprestidae Julodis onopordi CV152433CV153501 291 96 387 262 65 327 
    Pol.: Elateriformia: Eucinetidae Eucinetus sp. CV153502CV154310 179 150 329 203 114 317 
    Pol.: Elateriformia: Dascillidae Dascillus cervinus CV154311CV154939 194 135 329 197 128 325 
    Pol.: Cucujiformia: Bipyllidae Biphyllus lunatus BQ474131BQ474801 186 63 249 185 49 234 
    Pol.: Cucujiformia: Mycetophagidae Mycetophagus quadripustulatus CV154940CV155674 193 188 381 191 210 401 
    Pol.: Cucujiformia: Tenebrionidae Tribolium confusum CV155675CV155741 54 58 59 64 
    Pol.: Cucujiformia: Chrysomelidae Timarcha balearica AJ537611AJ538039 55 210 265 170 97 267 
    Pol.: Cucujiformia: Curculionidae Curculio glandium BQ476162BQ476740 142 86 228 162 60 222 
    Pol.: Cucujiformia: Anthribidae Platystomos albinus BQ476142BQ476161 108 34 142 99 31 130 
Insecta: Lepidoptera         
    Noctuidae Euclidea glyphica CV174082CV174651 186 80 266 197 66 263 
    Papilionidae Papilio dardanus CV174652CV175351 163 243 406 219 115 334 
Insecta: Strepsiptera         
    Mengenillidae Mengenilla chobauti CD485368CD485367 51 280 331 57 288 345 
    Mengenillidae Eoxenos laboulbenei CD492361CD492706 54 321 375 54 335 389 
Insecta: Raphidiodea         
    Raphidiidae Phaeostigma major CV176478CV176535 51 54 47 55 
Insecta: Trichoptera         
    Limnephilidae Limnephilus flavicornis CV176536CV176696 23 100 123 25 95 120 
Insecta: Mecoptera         
    Panorpidae Panorpa cf. vulgaris CV176697CV177401 240 100 340 246 72 318 
Insecta: Orthoptera         
    Gryllidae Gryllus bimaculatus CV175352CV175963 223 90 313 238 72 310 
Insecta: Dictyoptera         
    Mantidae Sphodromantis centralis CV175964CV176136 14 86 100 17 94 111 
Insecta: Hemiptera         
    Aleyrodidae Aleurothrixus sp CV176137CV176477 67 165 232 59 173 232 
Insecta: Thysanura         
    Lepismatidae Lepisma aurea CV177402CV177826 63 273 336 65 275 340 
Outgroups         
Arachnida: Araneae: Dysderidae Dysdera erythrina CV177827CV178552 197 65 262 213 44 257 
Diplopoda: Julida Julida sp. CV178553CV179005 122 91 213 142 68 210 
Total
 
34
 

 
4,524
 
4,386
 
8,910
 
4,855
 
3,873
 
8,728
 

Class: Order 

 

 

Number of Automatedb
 
  
Number of Manual
 
  
    Suborder: Series: Family
 
Species
 
Accession Numbera
 
Contigs
 
Singletons
 
Sequences
 
Contigs
 
Singletons
 
Sequences
 
Insecta: Coleoptera         
    Archostemata: Micromalthidae Micromalthus debilis CV155742CV155959 24 108 132 26 124 150 
    Myxophaga: Sphaeriusidae Sphaerius sp. CV155960CV156656 159 181 340 193 165 358 
    Adephaga: Carabidae Carabus granulatus BQ474802BQ475107 77 90 167 72 89 161 
    Adephaga: Cicindelidae Cicindela campestris BQ475108BG475778 301 64 365 278 58 336 
 Cicindela litorea CV156657CV157115 150 72 222 158 63 221 
 Cicindela littoralis CV157116CV157483 86 106 192 106 107 213 
    Adephaga: Dytiscidae Meladema coriacea BQ476741BQ477288 123 166 289 122 164 286 
    Pol.: Staphyliniformia: Georissidae Georissus sp. CV157484CV158376 224 161 385 258 133 391 
    Pol.: Staphyliniformia: Silphidae Silpha atrata CV158377CV158395 14 10 17 
    Pol.: Staphyliniformia: Histeridae Hister sp. CV158396CV159219 185 141 326 192 130 322 
    Pol.: Scarabaeiformia: Scarabaeidae Scarabaeus laticollis CV159220CV160155 261 119 380 226 75 301 
    Pol.: Elateriformia: Elateridae Agriotes lineatus CV160156CV160927 171 208 379 203 198 401 
    Pol.: Elateriformia: Buprestidae Julodis onopordi CV152433CV153501 291 96 387 262 65 327 
    Pol.: Elateriformia: Eucinetidae Eucinetus sp. CV153502CV154310 179 150 329 203 114 317 
    Pol.: Elateriformia: Dascillidae Dascillus cervinus CV154311CV154939 194 135 329 197 128 325 
    Pol.: Cucujiformia: Bipyllidae Biphyllus lunatus BQ474131BQ474801 186 63 249 185 49 234 
    Pol.: Cucujiformia: Mycetophagidae Mycetophagus quadripustulatus CV154940CV155674 193 188 381 191 210 401 
    Pol.: Cucujiformia: Tenebrionidae Tribolium confusum CV155675CV155741 54 58 59 64 
    Pol.: Cucujiformia: Chrysomelidae Timarcha balearica AJ537611AJ538039 55 210 265 170 97 267 
    Pol.: Cucujiformia: Curculionidae Curculio glandium BQ476162BQ476740 142 86 228 162 60 222 
    Pol.: Cucujiformia: Anthribidae Platystomos albinus BQ476142BQ476161 108 34 142 99 31 130 
Insecta: Lepidoptera         
    Noctuidae Euclidea glyphica CV174082CV174651 186 80 266 197 66 263 
    Papilionidae Papilio dardanus CV174652CV175351 163 243 406 219 115 334 
Insecta: Strepsiptera         
    Mengenillidae Mengenilla chobauti CD485368CD485367 51 280 331 57 288 345 
    Mengenillidae Eoxenos laboulbenei CD492361CD492706 54 321 375 54 335 389 
Insecta: Raphidiodea         
    Raphidiidae Phaeostigma major CV176478CV176535 51 54 47 55 
Insecta: Trichoptera         
    Limnephilidae Limnephilus flavicornis CV176536CV176696 23 100 123 25 95 120 
Insecta: Mecoptera         
    Panorpidae Panorpa cf. vulgaris CV176697CV177401 240 100 340 246 72 318 
Insecta: Orthoptera         
    Gryllidae Gryllus bimaculatus CV175352CV175963 223 90 313 238 72 310 
Insecta: Dictyoptera         
    Mantidae Sphodromantis centralis CV175964CV176136 14 86 100 17 94 111 
Insecta: Hemiptera         
    Aleyrodidae Aleurothrixus sp CV176137CV176477 67 165 232 59 173 232 
Insecta: Thysanura         
    Lepismatidae Lepisma aurea CV177402CV177826 63 273 336 65 275 340 
Outgroups         
Arachnida: Araneae: Dysderidae Dysdera erythrina CV177827CV178552 197 65 262 213 44 257 
Diplopoda: Julida Julida sp. CV178553CV179005 122 91 213 142 68 210 
Total
 
34
 

 
4,524
 
4,386
 
8,910
 
4,855
 
3,873
 
8,728
 
a

Mitochondrial sequences have been removed from the submitted sequences but have been used in all analyses.

b

Number of TUGs, singletons, and unique sequences after EST vector trimming and sequence quality control for the automated and manual approaches.

Table 1

List of Species, Their Taxonomy, and GenBank Accession Numbers for the ESTs


Class: Order 

 

 

Number of Automatedb
 
  
Number of Manual
 
  
    Suborder: Series: Family
 
Species
 
Accession Numbera
 
Contigs
 
Singletons
 
Sequences
 
Contigs
 
Singletons
 
Sequences
 
Insecta: Coleoptera         
    Archostemata: Micromalthidae Micromalthus debilis CV155742CV155959 24 108 132 26 124 150 
    Myxophaga: Sphaeriusidae Sphaerius sp. CV155960CV156656 159 181 340 193 165 358 
    Adephaga: Carabidae Carabus granulatus BQ474802BQ475107 77 90 167 72 89 161 
    Adephaga: Cicindelidae Cicindela campestris BQ475108BG475778 301 64 365 278 58 336 
 Cicindela litorea CV156657CV157115 150 72 222 158 63 221 
 Cicindela littoralis CV157116CV157483 86 106 192 106 107 213 
    Adephaga: Dytiscidae Meladema coriacea BQ476741BQ477288 123 166 289 122 164 286 
    Pol.: Staphyliniformia: Georissidae Georissus sp. CV157484CV158376 224 161 385 258 133 391 
    Pol.: Staphyliniformia: Silphidae Silpha atrata CV158377CV158395 14 10 17 
    Pol.: Staphyliniformia: Histeridae Hister sp. CV158396CV159219 185 141 326 192 130 322 
    Pol.: Scarabaeiformia: Scarabaeidae Scarabaeus laticollis CV159220CV160155 261 119 380 226 75 301 
    Pol.: Elateriformia: Elateridae Agriotes lineatus CV160156CV160927 171 208 379 203 198 401 
    Pol.: Elateriformia: Buprestidae Julodis onopordi CV152433CV153501 291 96 387 262 65 327 
    Pol.: Elateriformia: Eucinetidae Eucinetus sp. CV153502CV154310 179 150 329 203 114 317 
    Pol.: Elateriformia: Dascillidae Dascillus cervinus CV154311CV154939 194 135 329 197 128 325 
    Pol.: Cucujiformia: Bipyllidae Biphyllus lunatus BQ474131BQ474801 186 63 249 185 49 234 
    Pol.: Cucujiformia: Mycetophagidae Mycetophagus quadripustulatus CV154940CV155674 193 188 381 191 210 401 
    Pol.: Cucujiformia: Tenebrionidae Tribolium confusum CV155675CV155741 54 58 59 64 
    Pol.: Cucujiformia: Chrysomelidae Timarcha balearica AJ537611AJ538039 55 210 265 170 97 267 
    Pol.: Cucujiformia: Curculionidae Curculio glandium BQ476162BQ476740 142 86 228 162 60 222 
    Pol.: Cucujiformia: Anthribidae Platystomos albinus BQ476142BQ476161 108 34 142 99 31 130 
Insecta: Lepidoptera         
    Noctuidae Euclidea glyphica CV174082CV174651 186 80 266 197 66 263 
    Papilionidae Papilio dardanus CV174652CV175351 163 243 406 219 115 334 
Insecta: Strepsiptera         
    Mengenillidae Mengenilla chobauti CD485368CD485367 51 280 331 57 288 345 
    Mengenillidae Eoxenos laboulbenei CD492361CD492706 54 321 375 54 335 389 
Insecta: Raphidiodea         
    Raphidiidae Phaeostigma major CV176478CV176535 51 54 47 55 
Insecta: Trichoptera         
    Limnephilidae Limnephilus flavicornis CV176536CV176696 23 100 123 25 95 120 
Insecta: Mecoptera         
    Panorpidae Panorpa cf. vulgaris CV176697CV177401 240 100 340 246 72 318 
Insecta: Orthoptera         
    Gryllidae Gryllus bimaculatus CV175352CV175963 223 90 313 238 72 310 
Insecta: Dictyoptera         
    Mantidae Sphodromantis centralis CV175964CV176136 14 86 100 17 94 111 
Insecta: Hemiptera         
    Aleyrodidae Aleurothrixus sp CV176137CV176477 67 165 232 59 173 232 
Insecta: Thysanura         
    Lepismatidae Lepisma aurea CV177402CV177826 63 273 336 65 275 340 
Outgroups         
Arachnida: Araneae: Dysderidae Dysdera erythrina CV177827CV178552 197 65 262 213 44 257 
Diplopoda: Julida Julida sp. CV178553CV179005 122 91 213 142 68 210 
Total
 
34
 

 
4,524
 
4,386
 
8,910
 
4,855
 
3,873
 
8,728
 

Class: Order 

 

 

Number of Automatedb
 
  
Number of Manual
 
  
    Suborder: Series: Family
 
Species
 
Accession Numbera
 
Contigs
 
Singletons
 
Sequences
 
Contigs
 
Singletons
 
Sequences
 
Insecta: Coleoptera         
    Archostemata: Micromalthidae Micromalthus debilis CV155742CV155959 24 108 132 26 124 150 
    Myxophaga: Sphaeriusidae Sphaerius sp. CV155960CV156656 159 181 340 193 165 358 
    Adephaga: Carabidae Carabus granulatus BQ474802BQ475107 77 90 167 72 89 161 
    Adephaga: Cicindelidae Cicindela campestris BQ475108BG475778 301 64 365 278 58 336 
 Cicindela litorea CV156657CV157115 150 72 222 158 63 221 
 Cicindela littoralis CV157116CV157483 86 106 192 106 107 213 
    Adephaga: Dytiscidae Meladema coriacea BQ476741BQ477288 123 166 289 122 164 286 
    Pol.: Staphyliniformia: Georissidae Georissus sp. CV157484CV158376 224 161 385 258 133 391 
    Pol.: Staphyliniformia: Silphidae Silpha atrata CV158377CV158395 14 10 17 
    Pol.: Staphyliniformia: Histeridae Hister sp. CV158396CV159219 185 141 326 192 130 322 
    Pol.: Scarabaeiformia: Scarabaeidae Scarabaeus laticollis CV159220CV160155 261 119 380 226 75 301 
    Pol.: Elateriformia: Elateridae Agriotes lineatus CV160156CV160927 171 208 379 203 198 401 
    Pol.: Elateriformia: Buprestidae Julodis onopordi CV152433CV153501 291 96 387 262 65 327 
    Pol.: Elateriformia: Eucinetidae Eucinetus sp. CV153502CV154310 179 150 329 203 114 317 
    Pol.: Elateriformia: Dascillidae Dascillus cervinus CV154311CV154939 194 135 329 197 128 325 
    Pol.: Cucujiformia: Bipyllidae Biphyllus lunatus BQ474131BQ474801 186 63 249 185 49 234 
    Pol.: Cucujiformia: Mycetophagidae Mycetophagus quadripustulatus CV154940CV155674 193 188 381 191 210 401 
    Pol.: Cucujiformia: Tenebrionidae Tribolium confusum CV155675CV155741 54 58 59 64 
    Pol.: Cucujiformia: Chrysomelidae Timarcha balearica AJ537611AJ538039 55 210 265 170 97 267 
    Pol.: Cucujiformia: Curculionidae Curculio glandium BQ476162BQ476740 142 86 228 162 60 222 
    Pol.: Cucujiformia: Anthribidae Platystomos albinus BQ476142BQ476161 108 34 142 99 31 130 
Insecta: Lepidoptera         
    Noctuidae Euclidea glyphica CV174082CV174651 186 80 266 197 66 263 
    Papilionidae Papilio dardanus CV174652CV175351 163 243 406 219 115 334 
Insecta: Strepsiptera         
    Mengenillidae Mengenilla chobauti CD485368CD485367 51 280 331 57 288 345 
    Mengenillidae Eoxenos laboulbenei CD492361CD492706 54 321 375 54 335 389 
Insecta: Raphidiodea         
    Raphidiidae Phaeostigma major CV176478CV176535 51 54 47 55 
Insecta: Trichoptera         
    Limnephilidae Limnephilus flavicornis CV176536CV176696 23 100 123 25 95 120 
Insecta: Mecoptera         
    Panorpidae Panorpa cf. vulgaris CV176697CV177401 240 100 340 246 72 318 
Insecta: Orthoptera         
    Gryllidae Gryllus bimaculatus CV175352CV175963 223 90 313 238 72 310 
Insecta: Dictyoptera         
    Mantidae Sphodromantis centralis CV175964CV176136 14 86 100 17 94 111 
Insecta: Hemiptera         
    Aleyrodidae Aleurothrixus sp CV176137CV176477 67 165 232 59 173 232 
Insecta: Thysanura         
    Lepismatidae Lepisma aurea CV177402CV177826 63 273 336 65 275 340 
Outgroups         
Arachnida: Araneae: Dysderidae Dysdera erythrina CV177827CV178552 197 65 262 213 44 257 
Diplopoda: Julida Julida sp. CV178553CV179005 122 91 213 142 68 210 
Total
 
34
 

 
4,524
 
4,386
 
8,910
 
4,855
 
3,873
 
8,728
 
a

Mitochondrial sequences have been removed from the submitted sequences but have been used in all analyses.

b

Number of TUGs, singletons, and unique sequences after EST vector trimming and sequence quality control for the automated and manual approaches.

For most libraries, ESTs were sequenced in both directions to provide longer and more accurate sequences which is critical for phylogenetic analysis. Sequencher 4.1 (Gene Codes Corp., Ann Harbor, Mich.) was used for sequence editing, including the automated removal of vector sequences and poor-quality data. Sequences were further edited manually to recall ambiguities and resolve conflicting base calls in forward and reverse reads where multiple clones were available. Edited sequences were clustered into contigs in Sequencher at high stringency to obtain “tentative unique genes” (TUGs) for each library and exported for further analysis. We also used a fully automated method for sequence editing with the Trace2dbest perl script (Parkinson and Blaxter 2004), based on the Phred base-calling software (Ewing and Green 1998; Ewing et al. 1998). The PartiGene script (Parkinson et al. 2004a) was used to cluster redundant sequences using the CLOBB EST software (Parkinson, Guiliano, and Blaxter 2002) and Phrap (P. Green, personal communication). The manually edited EST sequences were submitted to the National Center for Biotechnology Information EST database (table 1, lineage and accession numbers). Mitochondrial and rRNA transcripts were excluded from GenBank EST submissions, but the full data are available from http://www.bio.ic.ac.uk/research/apvogler/vogler.htm.

Sequence Clustering and Phylogenetic Analysis

EST sequences were subjected to Blast comparisons against GenBank using BlastN (nucleotide–nucleotide searches) and TBlastX (conceptual protein translations) (Altschul et al. 1990). Where significant matches were found (E value >10−5) and putative gene identity was established by these sequence comparisons, TUGs were assigned gene ontology (GO) classifications by comparing deduced amino acids with the Uniprot database and parsing of the Uniprot GO table (http://www.ebi.ac.uk/uniprot/index.html). Gene classifications were accepted if our data had 30% similarity over >100 amino acids with curated data and a significant E value (>10−5) in a TBlastX search. When parsed for GO classification, we accepted identity from lower TBlastX matches if top matches did not contain GO classifications. TBlastX searches were used to calculate the proportion of TUGs which matched sets of proteins from D. melanogaster, Homo sapiens, and Caenorhabditis elegans with E values <10−5.

For clustering, similarity between TUGs within and between libraries was determined using TBlastX searches. For each TUG, its TBlastX hits were examined, and if the similarity was above a specified threshold, then a cluster was made. These first-pass clusters contained many TUGs in more than one cluster, so these clusters were themselves iteratively merged and redundant sequences removed, until there were no sequences contained in more than one cluster. The Python scripts used for clustering are available from PGF on request. TUGs clustered in searches were translated in Sequencher and aligned with ClustalX (Thompson et al. 1997).

For phylogenetic analysis from these clusters, we focused specifically on the RP genes. After minor sequence editing and verification of transcript fidelity, the most complete amino acid sequences were used for conceptual translations using ClustalX and submitted to European Molecular Biology Laboratory nr databases (Supplementary Material A, see Supplementary Material online). Three further Coleoptera, Tribolium castaneum (J. Savard and D. Tautz, personal communication), Callosobruchus maculatus (J. H. F. Pedra, A. Brandt, R. Westerman, H.-M. Li, J. Romero-Severson, L. L. Murdock, and B. R. Pittendrigh, personal communication), and Ips pini (Eigenheer et al. 2003) with public ESTs in GenBank were also searched for RPs and used in the phylogenetic analysis. After excluding the smallest EST libraries (Silpha atrata and Tribolium confusum), we concatenated data from 66 RPs found in four or more species of Coleoptera, which correspond to minimal phylogenetic clusters (sensuDriskell et al. 2004). Regions of uncertain amino acid alignment homology were removed using Gblocks 0.91b (Castresana 2000). Phylogenetic analysis was conducted with parsimony, with a heuristic search strategy (random taxon addition, 100 replicates; Tree Bisection-Reconnection branch swapping). We used PAUP* to calculate nonparametric bootstrap scores (1,000 replicates) and Bremer support, facilitated by TreeRot 2.0 (Sorenson 1999). Phyml v2.4.4 (Guindon and Gascuel 2003) was used for maximum likelihood (ML) analyses with 100 bootstraps, using both the WAG substitution model, suitable for soluble proteins such as RPs, and the Dayhoff model selected with ModelGenerator (http://bioinf.nuim.ie/software/modelgenerator). With both models, we accounted for the among-site rate variation using a gamma distribution and a proportion of invariant sites (pInvar). Bayesian analyses were also conducted using the latter model on the concatenated multigene data set with MrBayes v3.1.1 (Huelsenbeck and Ronquist 2001). Nodal support was assessed as posterior probability from two independent runs each with four chains of 1,000,000 generations in the Markov chain Monte Carlo procedure (the first 500,000 generations were discarded as “burn-in”). In an alternative supertree approach, the same amino acid alignments from each RP gene were first used individually for parsimony analysis using branch and bound searches. For each RP, the strict consensus tree was saved to the file, and resolved nodes were recoded as binary state using matrix representation with parsimony coding (Baum 1992; Ragan 1992) with Clann 2.0.1 (Creevey and McInerney 2005).

Results

Characteristics of the Libraries

Among the EST libraries for 32 insect species, plus two arthropod outgroups (a spider and millipede), we sampled 20 species of Coleoptera, with representatives from each of the four suborders, and a selection of all major groups (Series) in the large suborder Polyphaga. Together, the libraries contained 23,026 EST sequences with high-quality base calls, ranging from 29 to 1,341 ESTs per taxon (table 1). In total, 8,728 TUGs were obtained after semimanual editing (Materials and Methods). Automated editing of the same data produced 8,910 unique sequences, with ∼7% fewer sequences in redundant groups and 12% more singletons (table 1). Overall sequence similarity and statistical analysis (below) produced similar results as the manually edited sequences, and hence, the automated EST clustering appears sufficiently reliable for the initial compilation of large data sets, in particular as sequence quality increases with greater number of ESTs in a TUG.

According to the GO categorization of the 34 EST libraries (table 2), the nuclear genes most frequently detected were “housekeeping” genes, including RPs and enzymes. Transcripts from mitochondrial genes were also prevalent, with an average of six mitochondrial transcripts per taxon. Although mitochondrial sequences present in EST libraries are an artifact of the reverse transcriptase–PCR procedure, they provide valuable phylogenetic markers. In contrast, relatively few developmental proteins, transcription factors, and elongation factors (EFs) were detected among ESTs. A large number of ESTs showed significant similarity to genes of unknown function in the Uniprot database (5%–37% depending on the library), and in each taxon, a large proportion of the sequences (35%–80%) did not have any significant public database matches within the search parameters.

Table 2

The 34 EST Libraries Categorized According to GO Terms


 

GO Classification
 
                    
Percent Matchesc
 
    
Library
 
Enzyme
 
RP
 
Mitochondrial Gene
 
Transport
 
Nucleic acid Bindinga
 
Chaperone/Heat Shock
 
Motor
 
Protein Kinase/Phosphatase
 
Developmental Protein
 
Axon/Neurotransmitter
 
EF
 
Signal Transduction
 
Cell Cycle
 
Proteasome
 
Translation Initiation Factor
 
Actin Binding
 
Transcription Factor
 
Cell Adhesion
 
Unknown
 
No Matchb
 
Total
 
To D. melanogaster
 
To H. sapiens
 
To C. elegans
 
To the other libraries
 
Intralibrary
 
Micromalthus debillis 12           33 83 150 46 36 31 24 15 
Sphaerius sp. 14 21         75 227 358 45 36 31 18 
Carabus granulatus 18       50 56 161 63 58 54 47 13 
Cicindela campestris 22 13 15       77 190 336 48 42 36 28 
Cicindela litorea            40 151 221 38 37 25 16 
Cicindela littoralis 10          47 128 213 46 42 33 30 16 
Meladema coriacea 25 10        84 131 286 52 49 42 25 14 
Georissus sp. 28 18 10 10       98 211 391 52 45 37 27 15 
Silpha atrata                17 29 35 18 53 
Hister sp. 25 16      84 164 322 58 54 47 25 10 
Scarabaeus laticollis 21 15 12      113 121 301 64 53 48 26 19 
Agriotes lineatus 22 12      83 247 401 45 39 35 23 20 
Julodis onopordi 15 11        100 179 327 46 42 33 19 14 
Eucinetus sp. 13 13 11 14      78 170 317 60 50 41 24 13 
Dascillus cervinus 15 11 10       104 162 325 61 35 23 23 11 
Biphyllus lunatus 14 19      81 97 234 67 58 51 37 15 
Mycetophagus 4-pustulatus 36 11        107 214 401 57 50 42 25 13 
Tribolium confusum             22 31 64 56 50 38 25 
Timarcha balearica 27 21        77 118 267 60 54 49 27 14 
Curculio glandium 17        76 95 221 66 56 39 37 10 
Platystomos albinus 10         35 65 130 47 44 42 25 18 
Euclidia glyphica 18 10      38 165 255 44 34 29 23 17 
Papilio dardanus 18 43 15 10       99 131 334 63 58 52 36 23 
Mengenilla chobauti 20 12    87 187 346 56 48 41 23 
Eoxenos laboulbenei 23 18 10     121 186 389 60 54 45 21 11 
Gryllus bimaculatus 11       37 230 310 32 28 24 16 12 
Sphrodromantis centralis             89 111 21 22 14 17 
Aleurothrixus sp. 14 10      52 166 270 45 42 38 21 
Phaeostigma major              10 31 55 47 40 36 38 
Limnephilus flavicornis              13 87 120 29 25 22 19 11 
Panorpa cf. vulgaris 17        88 184 318 54 45 40 18 12 
Lepisma aurea 15 25 10   81 160 340 57 53 46 29 13 
Dysdera erythrina 13 10 15     45 148 257 45 44 39 14 22 
Julida sp. 14         27 140 210 36 38 31 10 14 
Total 502 401 220 216 109 41 56 53 43 40 35 29 25 17 17 11 12 2,147 4,753 8,737      
Average
 
15
 
12
 
6
 
7
 
4
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
1
 
1
 
2
 
1
 
64
 
140
 
258
 
50
 
44
 
37
 
26
 
38
 

 

GO Classification
 
                    
Percent Matchesc
 
    
Library
 
Enzyme
 
RP
 
Mitochondrial Gene
 
Transport
 
Nucleic acid Bindinga
 
Chaperone/Heat Shock
 
Motor
 
Protein Kinase/Phosphatase
 
Developmental Protein
 
Axon/Neurotransmitter
 
EF
 
Signal Transduction
 
Cell Cycle
 
Proteasome
 
Translation Initiation Factor
 
Actin Binding
 
Transcription Factor
 
Cell Adhesion
 
Unknown
 
No Matchb
 
Total
 
To D. melanogaster
 
To H. sapiens
 
To C. elegans
 
To the other libraries
 
Intralibrary
 
Micromalthus debillis 12           33 83 150 46 36 31 24 15 
Sphaerius sp. 14 21         75 227 358 45 36 31 18 
Carabus granulatus 18       50 56 161 63 58 54 47 13 
Cicindela campestris 22 13 15       77 190 336 48 42 36 28 
Cicindela litorea            40 151 221 38 37 25 16 
Cicindela littoralis 10          47 128 213 46 42 33 30 16 
Meladema coriacea 25 10        84 131 286 52 49 42 25 14 
Georissus sp. 28 18 10 10       98 211 391 52 45 37 27 15 
Silpha atrata                17 29 35 18 53 
Hister sp. 25 16      84 164 322 58 54 47 25 10 
Scarabaeus laticollis 21 15 12      113 121 301 64 53 48 26 19 
Agriotes lineatus 22 12      83 247 401 45 39 35 23 20 
Julodis onopordi 15 11        100 179 327 46 42 33 19 14 
Eucinetus sp. 13 13 11 14      78 170 317 60 50 41 24 13 
Dascillus cervinus 15 11 10       104 162 325 61 35 23 23 11 
Biphyllus lunatus 14 19      81 97 234 67 58 51 37 15 
Mycetophagus 4-pustulatus 36 11        107 214 401 57 50 42 25 13 
Tribolium confusum             22 31 64 56 50 38 25 
Timarcha balearica 27 21        77 118 267 60 54 49 27 14 
Curculio glandium 17        76 95 221 66 56 39 37 10 
Platystomos albinus 10         35 65 130 47 44 42 25 18 
Euclidia glyphica 18 10      38 165 255 44 34 29 23 17 
Papilio dardanus 18 43 15 10       99 131 334 63 58 52 36 23 
Mengenilla chobauti 20 12    87 187 346 56 48 41 23 
Eoxenos laboulbenei 23 18 10     121 186 389 60 54 45 21 11 
Gryllus bimaculatus 11       37 230 310 32 28 24 16 12 
Sphrodromantis centralis             89 111 21 22 14 17 
Aleurothrixus sp. 14 10      52 166 270 45 42 38 21 
Phaeostigma major              10 31 55 47 40 36 38 
Limnephilus flavicornis              13 87 120 29 25 22 19 11 
Panorpa cf. vulgaris 17        88 184 318 54 45 40 18 12 
Lepisma aurea 15 25 10   81 160 340 57 53 46 29 13 
Dysdera erythrina 13 10 15     45 148 257 45 44 39 14 22 
Julida sp. 14         27 140 210 36 38 31 10 14 
Total 502 401 220 216 109 41 56 53 43 40 35 29 25 17 17 11 12 2,147 4,753 8,737      
Average
 
15
 
12
 
6
 
7
 
4
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
1
 
1
 
2
 
1
 
64
 
140
 
258
 
50
 
44
 
37
 
26
 
38
 
a

Includes RNA processing and small nuclear ribonucleoprotein complex.

b

No Blast hits found within the selected parameters of >30% similarity, >100 amino acids, and E value lower than 10−5.

c

The percentage of sequences in a given library with matches to complete databases (known and predicted proteins) of Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans (BlastX, E value < 10-5); the nucleotide sequences of other libraries in this study BlastN, E-value < 10−5); and translated sequences within the library (TBlastX E-value<10−5).

Table 2

The 34 EST Libraries Categorized According to GO Terms


 

GO Classification
 
                    
Percent Matchesc
 
    
Library
 
Enzyme
 
RP
 
Mitochondrial Gene
 
Transport
 
Nucleic acid Bindinga
 
Chaperone/Heat Shock
 
Motor
 
Protein Kinase/Phosphatase
 
Developmental Protein
 
Axon/Neurotransmitter
 
EF
 
Signal Transduction
 
Cell Cycle
 
Proteasome
 
Translation Initiation Factor
 
Actin Binding
 
Transcription Factor
 
Cell Adhesion
 
Unknown
 
No Matchb
 
Total
 
To D. melanogaster
 
To H. sapiens
 
To C. elegans
 
To the other libraries
 
Intralibrary
 
Micromalthus debillis 12           33 83 150 46 36 31 24 15 
Sphaerius sp. 14 21         75 227 358 45 36 31 18 
Carabus granulatus 18       50 56 161 63 58 54 47 13 
Cicindela campestris 22 13 15       77 190 336 48 42 36 28 
Cicindela litorea            40 151 221 38 37 25 16 
Cicindela littoralis 10          47 128 213 46 42 33 30 16 
Meladema coriacea 25 10        84 131 286 52 49 42 25 14 
Georissus sp. 28 18 10 10       98 211 391 52 45 37 27 15 
Silpha atrata                17 29 35 18 53 
Hister sp. 25 16      84 164 322 58 54 47 25 10 
Scarabaeus laticollis 21 15 12      113 121 301 64 53 48 26 19 
Agriotes lineatus 22 12      83 247 401 45 39 35 23 20 
Julodis onopordi 15 11        100 179 327 46 42 33 19 14 
Eucinetus sp. 13 13 11 14      78 170 317 60 50 41 24 13 
Dascillus cervinus 15 11 10       104 162 325 61 35 23 23 11 
Biphyllus lunatus 14 19      81 97 234 67 58 51 37 15 
Mycetophagus 4-pustulatus 36 11        107 214 401 57 50 42 25 13 
Tribolium confusum             22 31 64 56 50 38 25 
Timarcha balearica 27 21        77 118 267 60 54 49 27 14 
Curculio glandium 17        76 95 221 66 56 39 37 10 
Platystomos albinus 10         35 65 130 47 44 42 25 18 
Euclidia glyphica 18 10      38 165 255 44 34 29 23 17 
Papilio dardanus 18 43 15 10       99 131 334 63 58 52 36 23 
Mengenilla chobauti 20 12    87 187 346 56 48 41 23 
Eoxenos laboulbenei 23 18 10     121 186 389 60 54 45 21 11 
Gryllus bimaculatus 11       37 230 310 32 28 24 16 12 
Sphrodromantis centralis             89 111 21 22 14 17 
Aleurothrixus sp. 14 10      52 166 270 45 42 38 21 
Phaeostigma major              10 31 55 47 40 36 38 
Limnephilus flavicornis              13 87 120 29 25 22 19 11 
Panorpa cf. vulgaris 17        88 184 318 54 45 40 18 12 
Lepisma aurea 15 25 10   81 160 340 57 53 46 29 13 
Dysdera erythrina 13 10 15     45 148 257 45 44 39 14 22 
Julida sp. 14         27 140 210 36 38 31 10 14 
Total 502 401 220 216 109 41 56 53 43 40 35 29 25 17 17 11 12 2,147 4,753 8,737      
Average
 
15
 
12
 
6
 
7
 
4
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
1
 
1
 
2
 
1
 
64
 
140
 
258
 
50
 
44
 
37
 
26
 
38
 

 

GO Classification
 
                    
Percent Matchesc
 
    
Library
 
Enzyme
 
RP
 
Mitochondrial Gene
 
Transport
 
Nucleic acid Bindinga
 
Chaperone/Heat Shock
 
Motor
 
Protein Kinase/Phosphatase
 
Developmental Protein
 
Axon/Neurotransmitter
 
EF
 
Signal Transduction
 
Cell Cycle
 
Proteasome
 
Translation Initiation Factor
 
Actin Binding
 
Transcription Factor
 
Cell Adhesion
 
Unknown
 
No Matchb
 
Total
 
To D. melanogaster
 
To H. sapiens
 
To C. elegans
 
To the other libraries
 
Intralibrary
 
Micromalthus debillis 12           33 83 150 46 36 31 24 15 
Sphaerius sp. 14 21         75 227 358 45 36 31 18 
Carabus granulatus 18       50 56 161 63 58 54 47 13 
Cicindela campestris 22 13 15       77 190 336 48 42 36 28 
Cicindela litorea            40 151 221 38 37 25 16 
Cicindela littoralis 10          47 128 213 46 42 33 30 16 
Meladema coriacea 25 10        84 131 286 52 49 42 25 14 
Georissus sp. 28 18 10 10       98 211 391 52 45 37 27 15 
Silpha atrata                17 29 35 18 53 
Hister sp. 25 16      84 164 322 58 54 47 25 10 
Scarabaeus laticollis 21 15 12      113 121 301 64 53 48 26 19 
Agriotes lineatus 22 12      83 247 401 45 39 35 23 20 
Julodis onopordi 15 11        100 179 327 46 42 33 19 14 
Eucinetus sp. 13 13 11 14      78 170 317 60 50 41 24 13 
Dascillus cervinus 15 11 10       104 162 325 61 35 23 23 11 
Biphyllus lunatus 14 19      81 97 234 67 58 51 37 15 
Mycetophagus 4-pustulatus 36 11        107 214 401 57 50 42 25 13 
Tribolium confusum             22 31 64 56 50 38 25 
Timarcha balearica 27 21        77 118 267 60 54 49 27 14 
Curculio glandium 17        76 95 221 66 56 39 37 10 
Platystomos albinus 10         35 65 130 47 44 42 25 18 
Euclidia glyphica 18 10      38 165 255 44 34 29 23 17 
Papilio dardanus 18 43 15 10       99 131 334 63 58 52 36 23 
Mengenilla chobauti 20 12    87 187 346 56 48 41 23 
Eoxenos laboulbenei 23 18 10     121 186 389 60 54 45 21 11 
Gryllus bimaculatus 11       37 230 310 32 28 24 16 12 
Sphrodromantis centralis             89 111 21 22 14 17 
Aleurothrixus sp. 14 10      52 166 270 45 42 38 21 
Phaeostigma major              10 31 55 47 40 36 38 
Limnephilus flavicornis              13 87 120 29 25 22 19 11 
Panorpa cf. vulgaris 17        88 184 318 54 45 40 18 12 
Lepisma aurea 15 25 10   81 160 340 57 53 46 29 13 
Dysdera erythrina 13 10 15     45 148 257 45 44 39 14 22 
Julida sp. 14         27 140 210 36 38 31 10 14 
Total 502 401 220 216 109 41 56 53 43 40 35 29 25 17 17 11 12 2,147 4,753 8,737      
Average
 
15
 
12
 
6
 
7
 
4
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
2
 
1
 
1
 
2
 
1
 
64
 
140
 
258
 
50
 
44
 
37
 
26
 
38
 
a

Includes RNA processing and small nuclear ribonucleoprotein complex.

b

No Blast hits found within the selected parameters of >30% similarity, >100 amino acids, and E value lower than 10−5.

c

The percentage of sequences in a given library with matches to complete databases (known and predicted proteins) of Drosophila melanogaster, Homo sapiens, Caenorhabditis elegans (BlastX, E value < 10-5); the nucleotide sequences of other libraries in this study BlastN, E-value < 10−5); and translated sequences within the library (TBlastX E-value<10−5).

When our ESTs where compared against the genes of D. melanogaster, 50% of sequences had significant matches with E values <10−5 (ranging from 21% to 67% depending on probe species; table 2). Overall, our insect ESTs had significantly more matches with D. melanogaster sequences than with H. sapiens or C. elegans (df = 33, t = 6.5 and 5.6, respectively, P < 0.001). The insect ESTs showed fewer matches with C. elegans (df = 33, t = 8.7, P < 0.001) than with H. sapiens, despite the presumed closer relationships of nematodes with insects based on rRNA (Aguinaldo et al. 1997), protein-encoding genes (Philippe, Lartillot, and Brinkmann 2005), and genome-scale evidence (H. Dopazo and J. Dopazo 2005) but in accordance with analyses of genomic sequences and ESTs (Blair et al. 2002; Theodorides et al. 2002; Hedges et al. 2004; Philip, Creevey, and McInerney 2005). It is now increasingly well established that this affinity of insects with humans is an artifact of poor taxon sampling (Philippe, Lartillot, and Brinkmann 2005; Telford and Copley 2005). We present here the distribution of matches of each organism to the complete genomes of D. melanogaster, H. sapiens, and C. elegans as Venn diagrams using SimiTri (Parkinson and Blaxter 2003). Compared to ESTs of Coleoptera, levels of sequence similarity between nonholometabolan insect species and D. melanogaster were somewhat reduced (table 2 and SimiTri graphics at http://darwin.zoology.gla.ac.uk/∼jhughes/SimiTri/), as expected with decreased phylogenetic proximity.

Clustering Between Libraries

The presence of putative orthologs across libraries is critical for EST data to be useful in molecular systematics. Using the BlastN algorithm, we found that between 10% and 53% of unique sequences in a given library had matches (E value < 10−5) with the data set containing all the other libraries (table 2). After conceptual translation, pairwise sequence matches (TBlastX E < 10−5) ranged from 1% to 29% of unique sequences shared between any two libraries, with an average of 12% (Supplementary Material B, see Supplementary Material online). The number of intralibrary matches was slightly lower, with 0% to 23% of sequences showing significant matches within the same library in a protein-level search (table 2), but indicating a high proportion of paralogy in each library. Manual editing of primary sequences increased the between-library matches and the size of clusters at stringent cutoff values when compared to the automated approach (10−80: t = 2.2, df = 34, P < 0.05; 10−100: t = 2.4, df = 34, P < 0.05; 10−150: t = 2.1, df = 34, P < 0.05; Supplementary Material C, see Supplementary Material online).

When sequences with significant similarity were clustered across all libraries, up to 731 clusters included TUGs from two or more taxa, although no TUG had representatives in more than 28 of the 34 libraries. A total of 154 TUGs showed significant Blast matches within a single taxon only (Supplementary Material C, see Supplementary Material online). Most of the largest clusters, with TUGs in more than eight species at an E value < 10−10 (table 3; Supplementary Material D, see Supplementary Material online), included genes for which exceptional levels of mRNA expression have been established (Hsiao et al. 2001). The largest clusters included RPs and mitochondrial genes. Sequences from known protein families were also present in the clusters, such as tubulins, myosins, and troponin I. Three clusters contained EF genes (EF-1 alpha homologs, EF-1 beta, and EF-2). Interestingly, there were four clusters of genes that did not have any Blast matches, and six clusters that only showed matches to D. melanogaster and A. gambiae genes of unknown function (Supplementary Material D, see Supplementary Material online). Along with mitochondrial genes, several nuclear genes detected in multiple EST libraries have been used widely in insect molecular systematic studies (Caterino, Cho, and Sperling 2000). These included 28S rRNA (represented in EST libraries of 14 species), EF-1 alpha (13 species), H3 histone (7 species), and Cu, Zn–superoxide dismutase (7 species).

Table 3

Top TUG Clusters: Largest Clusters (number of cDNA libraries and number of sequences represented in the cluster) at Various TBlastX E Value Cutoffs


 

Cutoff E Value
 
             
Top Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
16S Ribosomal RNA gene 28 113 28 109 28 100 26 50 24 47 10 
Cytochrome oxidase subunit I 27 57 27 55 27 55 26 52 26 49 25 48 24 40 
Troponin/myosin family 25 85 25 83 25 81 16 24 15 18 13 16 10 11 
Cytochrome c oxidase subunit III 24 46 24 45 24 44 24 30 24 29 24 28 18 22 
Cytochrome oxidase subunit II 23 31 23 30 23 29 22 28 20 26 19 24 17 20 
Cytochrome b 23 26 23 26 23 26 22 23 22 23 21 22 20 21 
Chymotrypsin family 20 69 17 44 16 38 11 14   
Adenosine triphosphatase 6 20 25 20 25 20 24 15 19 15 17 10 11   
Actin 19 39 19 38 19 37 19 35 19 33 19 33 18 30 
Ubiquitin family 18 31 18 31 18 31 17 27 10 10 
Chemosensory protein 18 24 17 22 16 20     
Troponin I family 17 26 17 26 17 26 14 23 13 21 10 18 
Cathepsin family 15 34 12 19 11 16 10 
NADH dehydrogenase subunit 2 15 19 14 18 13 16     
NADH dehydrogenase subunit 4 15 17 15 16 15 15 14 14 13 13 11 11 
Tubulin family 14 34 13 31 13 30 11 24 
RAS oncogene family 14 26 11 17 13 11 
Heat shock protein family 14 23 10 18 10 18 12 
Disulfide isomerase/thioredoxin family 14 22 12 15     
28S Ribosomal RNA gene/ gamma-aminobutyric acid A receptor–associated protein 14 21 14 20 11 10   
Ferritin 1 family 14 18 14 18 14 17 12 14 11 13 10 
NADH dehydrogenase subunit 1 14 18 14 18 13 16 11 12 
MP20/CalPoNin
 
14
 
17
 
13
 
16
 
13
 
15
 
11
 
12
 
10
 
11
 
7
 
8
 
6
 
6
 

 

Cutoff E Value
 
             
Top Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
16S Ribosomal RNA gene 28 113 28 109 28 100 26 50 24 47 10 
Cytochrome oxidase subunit I 27 57 27 55 27 55 26 52 26 49 25 48 24 40 
Troponin/myosin family 25 85 25 83 25 81 16 24 15 18 13 16 10 11 
Cytochrome c oxidase subunit III 24 46 24 45 24 44 24 30 24 29 24 28 18 22 
Cytochrome oxidase subunit II 23 31 23 30 23 29 22 28 20 26 19 24 17 20 
Cytochrome b 23 26 23 26 23 26 22 23 22 23 21 22 20 21 
Chymotrypsin family 20 69 17 44 16 38 11 14   
Adenosine triphosphatase 6 20 25 20 25 20 24 15 19 15 17 10 11   
Actin 19 39 19 38 19 37 19 35 19 33 19 33 18 30 
Ubiquitin family 18 31 18 31 18 31 17 27 10 10 
Chemosensory protein 18 24 17 22 16 20     
Troponin I family 17 26 17 26 17 26 14 23 13 21 10 18 
Cathepsin family 15 34 12 19 11 16 10 
NADH dehydrogenase subunit 2 15 19 14 18 13 16     
NADH dehydrogenase subunit 4 15 17 15 16 15 15 14 14 13 13 11 11 
Tubulin family 14 34 13 31 13 30 11 24 
RAS oncogene family 14 26 11 17 13 11 
Heat shock protein family 14 23 10 18 10 18 12 
Disulfide isomerase/thioredoxin family 14 22 12 15     
28S Ribosomal RNA gene/ gamma-aminobutyric acid A receptor–associated protein 14 21 14 20 11 10   
Ferritin 1 family 14 18 14 18 14 17 12 14 11 13 10 
NADH dehydrogenase subunit 1 14 18 14 18 13 16 11 12 
MP20/CalPoNin
 
14
 
17
 
13
 
16
 
13
 
15
 
11
 
12
 
10
 
11
 
7
 
8
 
6
 
6
 

NOTE.—NADH, reduced form of nicotinamide adenine dinucleotide.

Table 3

Top TUG Clusters: Largest Clusters (number of cDNA libraries and number of sequences represented in the cluster) at Various TBlastX E Value Cutoffs


 

Cutoff E Value
 
             
Top Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
16S Ribosomal RNA gene 28 113 28 109 28 100 26 50 24 47 10 
Cytochrome oxidase subunit I 27 57 27 55 27 55 26 52 26 49 25 48 24 40 
Troponin/myosin family 25 85 25 83 25 81 16 24 15 18 13 16 10 11 
Cytochrome c oxidase subunit III 24 46 24 45 24 44 24 30 24 29 24 28 18 22 
Cytochrome oxidase subunit II 23 31 23 30 23 29 22 28 20 26 19 24 17 20 
Cytochrome b 23 26 23 26 23 26 22 23 22 23 21 22 20 21 
Chymotrypsin family 20 69 17 44 16 38 11 14   
Adenosine triphosphatase 6 20 25 20 25 20 24 15 19 15 17 10 11   
Actin 19 39 19 38 19 37 19 35 19 33 19 33 18 30 
Ubiquitin family 18 31 18 31 18 31 17 27 10 10 
Chemosensory protein 18 24 17 22 16 20     
Troponin I family 17 26 17 26 17 26 14 23 13 21 10 18 
Cathepsin family 15 34 12 19 11 16 10 
NADH dehydrogenase subunit 2 15 19 14 18 13 16     
NADH dehydrogenase subunit 4 15 17 15 16 15 15 14 14 13 13 11 11 
Tubulin family 14 34 13 31 13 30 11 24 
RAS oncogene family 14 26 11 17 13 11 
Heat shock protein family 14 23 10 18 10 18 12 
Disulfide isomerase/thioredoxin family 14 22 12 15     
28S Ribosomal RNA gene/ gamma-aminobutyric acid A receptor–associated protein 14 21 14 20 11 10   
Ferritin 1 family 14 18 14 18 14 17 12 14 11 13 10 
NADH dehydrogenase subunit 1 14 18 14 18 13 16 11 12 
MP20/CalPoNin
 
14
 
17
 
13
 
16
 
13
 
15
 
11
 
12
 
10
 
11
 
7
 
8
 
6
 
6
 

 

Cutoff E Value
 
             
Top Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
16S Ribosomal RNA gene 28 113 28 109 28 100 26 50 24 47 10 
Cytochrome oxidase subunit I 27 57 27 55 27 55 26 52 26 49 25 48 24 40 
Troponin/myosin family 25 85 25 83 25 81 16 24 15 18 13 16 10 11 
Cytochrome c oxidase subunit III 24 46 24 45 24 44 24 30 24 29 24 28 18 22 
Cytochrome oxidase subunit II 23 31 23 30 23 29 22 28 20 26 19 24 17 20 
Cytochrome b 23 26 23 26 23 26 22 23 22 23 21 22 20 21 
Chymotrypsin family 20 69 17 44 16 38 11 14   
Adenosine triphosphatase 6 20 25 20 25 20 24 15 19 15 17 10 11   
Actin 19 39 19 38 19 37 19 35 19 33 19 33 18 30 
Ubiquitin family 18 31 18 31 18 31 17 27 10 10 
Chemosensory protein 18 24 17 22 16 20     
Troponin I family 17 26 17 26 17 26 14 23 13 21 10 18 
Cathepsin family 15 34 12 19 11 16 10 
NADH dehydrogenase subunit 2 15 19 14 18 13 16     
NADH dehydrogenase subunit 4 15 17 15 16 15 15 14 14 13 13 11 11 
Tubulin family 14 34 13 31 13 30 11 24 
RAS oncogene family 14 26 11 17 13 11 
Heat shock protein family 14 23 10 18 10 18 12 
Disulfide isomerase/thioredoxin family 14 22 12 15     
28S Ribosomal RNA gene/ gamma-aminobutyric acid A receptor–associated protein 14 21 14 20 11 10   
Ferritin 1 family 14 18 14 18 14 17 12 14 11 13 10 
NADH dehydrogenase subunit 1 14 18 14 18 13 16 11 12 
MP20/CalPoNin
 
14
 
17
 
13
 
16
 
13
 
15
 
11
 
12
 
10
 
11
 
7
 
8
 
6
 
6
 

NOTE.—NADH, reduced form of nicotinamide adenine dinucleotide.

The number of clusters and their size were strongly dependent on the significance level of the Blast search partly due to the separation of paralogs at higher stringency. This was evident in tubulins (breaking up into alpha and beta superfamilies at higher stringency), myosins (separating to light chain I and regulatory light chain II), and troponin I (separating to troponin I a1 and troponin I b1). Table 4 presents those clusters with the number of unique sequences equal to the number of taxa (libraries), i.e., where each taxon contributes only one ortholog. Such potentially paralogy-free clusters included a maximum of 14 taxa. Many of these were identified as coding RP genes and were used to test the phylogenetic utility of the EST database.

Table 4

Top TUG Clusters: Potentially Orthologous TUG Sequences


 

Cutoff E Value
 
             
Top Orthologous TUG Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
Translationally controlled tumor protein 14 14 14 14 12 12 
No BlastX ID, no BlastN ID 13 13 11 11         
40S RP S8 13 13 13 13 13 13 11 11 11 11 10 10 
60S RP L24 12 12   
40S RP S30 12 12 11 11 11 11     
60S RP L27 12 12 12 12 12 12 12 12 10 10 10 10 
Cytochrome c oxidase subunit Vb 12 12 12 12 11 11 
CG32230-PA [Drosophila melanogaster11 11 11 11 10 10       
CG4692-PB [D. melanogaster11 11 11 11 11 11   
40S RP S23 11 11 11 11 11 11 11 11 11 11 10 10 
F1F0-type ATP synthase subunit g/CG6105-PA 10 10 10 10 10 10     
40S RP S18 10 10 10 10 10 10 
Peroxiredoxin V protein 
60S RP L11 
60S RP L27A 
60S RP L6 
60S RP L8 
40S RP S17 
60S RP L19 
60S RP L15 
12S Ribosomal RNA gene       
40S RP S10 
60S RP L36     
Dynein light chain 2   
40S RP S19 
Ribosome-associated membrane protein RAMP4         
Nitrogen fixation clusterlike   
40S RP S20   
Vacuolar ATP synthase subunit G
 
8
 
8
 
8
 
8
 
8
 
8
 
7
 
7
 

 

 

 

 

 

 

 

Cutoff E Value
 
             
Top Orthologous TUG Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
Translationally controlled tumor protein 14 14 14 14 12 12 
No BlastX ID, no BlastN ID 13 13 11 11         
40S RP S8 13 13 13 13 13 13 11 11 11 11 10 10 
60S RP L24 12 12   
40S RP S30 12 12 11 11 11 11     
60S RP L27 12 12 12 12 12 12 12 12 10 10 10 10 
Cytochrome c oxidase subunit Vb 12 12 12 12 11 11 
CG32230-PA [Drosophila melanogaster11 11 11 11 10 10       
CG4692-PB [D. melanogaster11 11 11 11 11 11   
40S RP S23 11 11 11 11 11 11 11 11 11 11 10 10 
F1F0-type ATP synthase subunit g/CG6105-PA 10 10 10 10 10 10     
40S RP S18 10 10 10 10 10 10 
Peroxiredoxin V protein 
60S RP L11 
60S RP L27A 
60S RP L6 
60S RP L8 
40S RP S17 
60S RP L19 
60S RP L15 
12S Ribosomal RNA gene       
40S RP S10 
60S RP L36     
Dynein light chain 2   
40S RP S19 
Ribosome-associated membrane protein RAMP4         
Nitrogen fixation clusterlike   
40S RP S20   
Vacuolar ATP synthase subunit G
 
8
 
8
 
8
 
8
 
8
 
8
 
7
 
7
 

 

 

 

 

 

 

NOTE.—ATP, adenosine triphosphate.

Table 4

Top TUG Clusters: Potentially Orthologous TUG Sequences


 

Cutoff E Value
 
             
Top Orthologous TUG Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
Translationally controlled tumor protein 14 14 14 14 12 12 
No BlastX ID, no BlastN ID 13 13 11 11         
40S RP S8 13 13 13 13 13 13 11 11 11 11 10 10 
60S RP L24 12 12   
40S RP S30 12 12 11 11 11 11     
60S RP L27 12 12 12 12 12 12 12 12 10 10 10 10 
Cytochrome c oxidase subunit Vb 12 12 12 12 11 11 
CG32230-PA [Drosophila melanogaster11 11 11 11 10 10       
CG4692-PB [D. melanogaster11 11 11 11 11 11   
40S RP S23 11 11 11 11 11 11 11 11 11 11 10 10 
F1F0-type ATP synthase subunit g/CG6105-PA 10 10 10 10 10 10     
40S RP S18 10 10 10 10 10 10 
Peroxiredoxin V protein 
60S RP L11 
60S RP L27A 
60S RP L6 
60S RP L8 
40S RP S17 
60S RP L19 
60S RP L15 
12S Ribosomal RNA gene       
40S RP S10 
60S RP L36     
Dynein light chain 2   
40S RP S19 
Ribosome-associated membrane protein RAMP4         
Nitrogen fixation clusterlike   
40S RP S20   
Vacuolar ATP synthase subunit G
 
8
 
8
 
8
 
8
 
8
 
8
 
7
 
7
 

 

 

 

 

 

 

 

Cutoff E Value
 
             
Top Orthologous TUG Clusters
 
10−10
 
 10−15
 
 10−20
 
 10−40
 
 10−50
 
 10−60
 
 10−80
 
 
Translationally controlled tumor protein 14 14 14 14 12 12 
No BlastX ID, no BlastN ID 13 13 11 11         
40S RP S8 13 13 13 13 13 13 11 11 11 11 10 10 
60S RP L24 12 12   
40S RP S30 12 12 11 11 11 11     
60S RP L27 12 12 12 12 12 12 12 12 10 10 10 10 
Cytochrome c oxidase subunit Vb 12 12 12 12 11 11 
CG32230-PA [Drosophila melanogaster11 11 11 11 10 10       
CG4692-PB [D. melanogaster11 11 11 11 11 11   
40S RP S23 11 11 11 11 11 11 11 11 11 11 10 10 
F1F0-type ATP synthase subunit g/CG6105-PA 10 10 10 10 10 10     
40S RP S18 10 10 10 10 10 10 
Peroxiredoxin V protein 
60S RP L11 
60S RP L27A 
60S RP L6 
60S RP L8 
40S RP S17 
60S RP L19 
60S RP L15 
12S Ribosomal RNA gene       
40S RP S10 
60S RP L36     
Dynein light chain 2   
40S RP S19 
Ribosome-associated membrane protein RAMP4         
Nitrogen fixation clusterlike   
40S RP S20   
Vacuolar ATP synthase subunit G
 
8
 
8
 
8
 
8
 
8
 
8
 
7
 
7
 

 

 

 

 

 

 

NOTE.—ATP, adenosine triphosphate.

The Higher Coleopteran (Beetle) Phylogeny from RPs

Out of a complete set of 76 nonacidic RPs found in insects (Landais et al. 2003), our typical EST libraries (with between 200–500 double-strand sequenced ESTs) recovered between 10 and 30 RPs, with notably fewer copies detected in smaller libraries (fig. 1). We also included data with existing larger EST resources for Tribolium (1,825 ESTs) and Ips (1,671 ESTs), which yielded a much higher proportion of RP genes and higher transcript redundancy (e.g., T. castaneum had a mean of five ESTs per RP ± 3 standard deviation). In each taxon, redundant sequences easily grouped together to generate a single transcript for each RP. The ease of grouping redundant transcripts increased the confidence that most, if not all, RPs have a single expressed copy in Coleoptera and hence that phylogenetic analyses were conducted across orthologous sequences.

FIG. 1.—

Correlation between the total number of ESTs sequenced and the number of genes from the 76 (nonacidic) nuclear RPs in Coleoptera.

FIG. 1.—

Correlation between the total number of ESTs sequenced and the number of genes from the 76 (nonacidic) nuclear RPs in Coleoptera.

Overall, these data suggest that the number of detected RP genes increases linearly with greater numbers of ESTs (fig. 1, R2 = 0.7006; y = 0.0278x) and further predict that libraries of ∼2,000 ESTs obtained from whole adult specimens can yield complete sets of RPs. The linear increase is consistent with the fact that different RP genes were recovered in different organisms (fig. 2), even if a similar total number of RPs was detected. This might indicate that most RPs genes have a similar chance to be cloned in our relatively small libraries, but the total ESTs sequenced needs to be higher than a few hundred ESTs to obtain the complete set of 76 RPs.

FIG. 2.—

Detection of the complete set of 76 nuclear RP genes in beetles (Series Cucujiformia). Genes are arranged along the x axis according to the standard RP nomenclature. Legend refers to coleopteran genus (number of RPs detected: number of N-terminus (5′/forward pass) ESTs sequenced with high base-call quality).

FIG. 2.—

Detection of the complete set of 76 nuclear RP genes in beetles (Series Cucujiformia). Genes are arranged along the x axis according to the standard RP nomenclature. Legend refers to coleopteran genus (number of RPs detected: number of N-terminus (5′/forward pass) ESTs sequenced with high base-call quality).

Phylogenetic analysis was conducted to establish basal relationships in the Coleoptera with 66 RP genes using both a “supertree” (derived from topology of individual gene trees) and a “supermatrix” (derived by simultaneous phylogenetic analysis of all sequence information). Ten additional RP genes were detected in less than 4 out of the 20 Coleoptera species and could not be used for phylogenetic analysis. After removing these sequences and the alignment-sensitive regions from all other genes (Materials and Methods), the final data matrix included a total of 10,403 amino acid residues, with individual taxa represented by 447 (Platystomos) to 9,151 (Tribolium) residues with an average 2,976 ± 1,892 residues and an overall degree of matrix completion of 28.6%. Individual genes were represented in between 4 and 10 taxa. When all 76 RPs are considered, the mean number of taxa per gene was 5.68. All methods of tree construction (maximum parsimony, ML, Bayesian, and supertree) produced similar tree topologies (fig. 3). At the deepest nodes, when rooted with the suborder Archostemata, the remaining coleopteran suborders resolved as (Adephaga (Myxophaga, Polyphaga)), although the supertree analyses placed Myxophaga (Sphaerius sp.) within the Polyphaga. In all analyses, the Elateriformia (one of the five Series of families of Polyphaga) was a paraphyletic assemblage of basal Polyphaga, with the Eucinetidae (Eucinetus sp.) sister to the remaining Series, Staphyliniformia, Scarabaeiformia, and Cucujiformia. The close relationship of Scarabaeus laticollis (Scarabaeiformia), Georissus sp., and Hister sp. (Staphyliniformia) supported the Haplogastra uniting both Series (Crowson 1955) but rendered the Staphyliniformia paraphyletic in accordance with recent findings (Korte et al. 2004; Caterino, Hunt, and Vogler 2005). The monophyly of Cucujiformia, a group of derived polyphagan beetles containing about half of all beetle species, was recovered, and the well-established superfamilies Tenebrionoidea, Chrysomeloidea, and Curculionoidea each were monophyletic, with Biphyllus lunatus (Biphyllidae) placed at the base, as expected. The supertree approach yielded generally less resolution and misplaced Platystomos (the smallest library) outside of the Phytophaga, as did the Bayesian analysis.

FIG. 3.—

Coleopteran phylogeny inferred from 66 RPs. (A) Bayesian phylogeny with posterior probabilities above the nodes and bootstrap support (>65%) from the ML analyses below the nodes. Italicized bootstrap values were obtained using the WAG substitution model and bold values using the Dayhoff model with 100 pseudoreplicate searches. The node marked * refers to the monophyletic Cucujiformia with the exception of Platystomos albinus. Inset shows the bootstrap support >65% using the Dayhoff model at basal nodes of the Coleoptera. (B) Left: Majority rule matrix representation with parsimony supertree using Baum and Ragan coding from parsimony analysis. Above nodes = Bremer support (proportion of equally parsimonious trees containing node if <100). Right: Majority rule solution using concatenated gene supermatrix from parsimony analysis. Above nodes = Bremer support (percent of equally parsimonious trees containing node if <100). Below nodes = bootstrap, based on 1,000 pseudoreplicates.

FIG. 3.—

Coleopteran phylogeny inferred from 66 RPs. (A) Bayesian phylogeny with posterior probabilities above the nodes and bootstrap support (>65%) from the ML analyses below the nodes. Italicized bootstrap values were obtained using the WAG substitution model and bold values using the Dayhoff model with 100 pseudoreplicate searches. The node marked * refers to the monophyletic Cucujiformia with the exception of Platystomos albinus. Inset shows the bootstrap support >65% using the Dayhoff model at basal nodes of the Coleoptera. (B) Left: Majority rule matrix representation with parsimony supertree using Baum and Ragan coding from parsimony analysis. Above nodes = Bremer support (proportion of equally parsimonious trees containing node if <100). Right: Majority rule solution using concatenated gene supermatrix from parsimony analysis. Above nodes = Bremer support (percent of equally parsimonious trees containing node if <100). Below nodes = bootstrap, based on 1,000 pseudoreplicates.

Discussion

EST databases are rapidly growing, with approximately 27.6 million entries in GenBank as of June 2005 (http://www.ncbi.nlm.nih.gov/dbEST/). Yet, until recently, the taxonomic coverage of the Class Insecta has been limited to 8 of the 25 or so insect orders. Within the largest order, Coleoptera, three libraries have become available recently, but taxonomically, these represent only a very limited group within one of the Series of Polyphaga. (Two further libraries were added to dbEST since our analysis was conducted.) EST representation in the insects has been severely biased toward Diptera, comprising 15 of 47 holometabolan insects as of June 2005 and ∼628,300 out of ∼919,200 EST sequences (excluding our data). Although the EST data sets presented here are small in comparison with other arthropod EST projects, we have almost doubled the taxonomic coverage of arthropod orders, including the first EST libraries for Strepsiptera, Rhaphidiodea, Trichoptera, Mecoptera, and Thysanura, and added over 11,000 ESTs from the Coleoptera, arguably the most diverse insect order, sampled from the broadest possible taxonomic diversity.

Our main aim was to test whether generating a small number of ESTs from a broad sample of taxa would be a suitable approach to phylogeny reconstruction. The findings confirm that even small libraries (<1,000 clones) show high levels of matching TUGs. Even with an average library size of 257 unique sequences, we recovered a conserved core of genes represented consistently across libraries. Many of these genes had not previously been used for phylogeny reconstruction, increasing the spectrum of molecular markers available to insect systematics. The most widely detected clusters contained mitochondrial DNA transcripts, enzymes, and RPs. However, tree construction was impeded by the great proportion of missing data entries, in particular due to several of the smaller libraries in our data set. Based on the completeness of RP representation in the libraries (fig. 1), we extrapolate that approximately 2,000 ESTs are needed to recover these highly expressed genes consistently when extracting total RNA from a whole adult specimen. Using embryonic tissues, for example, with a high rate of biosynthesis may increase the proportion of RPs in the libraries and lower the number of ESTs needed to generate the complete set of RPs in each taxon.

Such a large number of sequences may appear to be a costly way to establish phylogenetic relationships between taxa. However, the success of sequencing multiple single-copy loci to resolve the deeper nodes within the Tree of Life (e.g., in mammals: Murphy et al. 2001; Teeling et al. 2005) cannot easily be extended to most groups via traditional PCR methods using degenerate primers. Our efforts to amplify even a few nonstandard single-copy genes consistently within or across different superfamilies of the Coleoptera have largely failed (unpublished data), and the best results to date were obtained when the primers have been based on the EST sequences obtained here (Pons et al. 2004). As automation advances and the cost of sequencing decreases, dense EST sampling is likely to become a more cost-effective approach for acquiring single-copy nuclear markers for the deep-level molecular systematics of many groups.

A perhaps unexpected finding was the high degree of paralogy in most clusters evident from the large number of within-library similarity hits. Paralogs can prohibit the determination of species relationships and mislead phylogenetic inferences if they are not detected. However, tentative orthologous clusters (i.e., with only a single member per taxon) were readily detected and included up to 14 of the 34 taxa (some of which were present in very small libraries). In future, some of these genes may prove not to be paralogy free, but it is reassuring that they include a number of housekeeping genes, such as RPs, which are already known to be largely paralogy free across Metazoa (Landais et al. 2003; Philippe et al. 2004). Other large clusters that were paralogy free under high clustering stringency only (table 3) will require further analyses to separate different paralogy groups.

For molecular systematics, EST sequencing exposes us to hundreds of loci for which we have no existing information about the pattern of molecular variation and phylogenetic information content. At this early stage of comparative EST sequencing, it already seems obvious that only a minority of the available genes will emerge as useful for reconstructing phylogenetic relationships at the deeper hierarchical levels, whereas most gene sequences will be shown to suffer from shallow paralogy possibly linked to functional diversity. As EST sequences tend to be short, well-supported phylogenetic trees will only emerge when several genes of overlapping resolution are combined, together enhancing the phylogenetic signal (Olmstead and Sweere 1994; Gatesy et al. 1999). However, simultaneous analysis is only justified once orthology has been established.

Clearly, the RP genes provide such a resource and were used here to provide valuable insights into the phylogeny of Coleoptera (fig. 3). The relationships among the four suborders of Coleoptera have long been controversial (Hennig 1981; Lawrence and Newton 1995; Beutel and Haas 2000), with each of the three possible arrangements supported by reputable studies (Kukalova-Peck and Lawrence 1993; Beutel and Haas 2000; Caterino et al. 2002). The supermatrix analysis based on 66 RPs suggests the placement of Myxophaga as sister to Polyphaga which is consistent with the traditional view, going back to Crowson (1955, 1960), and several later studies based on various morphological character systems. These results conflict with those from 18S rRNA, which place Polyphaga with Adephaga as the sister, not Myxophaga (Caterino et al. 2002), but phylogenetic conclusions from this gene are affected by length variation and the rate heterogeneity, and hence, independent evidence from RPs is very valuable. Within the Polyphaga, the EST data supported the general ideas about basal relationships of the Series (the five traditional family groups of Polyphaga), including the paraphyly of Staphyliniformia with respect to Scarabaeiformia (Korte et al. 2004; Caterino et al. 2005), the paraphyly of Elateriformia and their basal position within Polyphaga (Caterino et al. 2002), and the monophyly of the large Cucujiformia and the large phytophagous Chrysomeloidea and Curculionoidea (“Phytophaga”).

In conclusion, we used dense EST sampling for molecular systematics, to avoid difficult PCR-based methods and extend the range of gene markers for multigene phylogenetics. Comparable studies in nematodes (Parkinson et al. 2004b) and Apicomplexa (Li et al. 2003) focused on gene discovery and comparative genomics, and it will be interesting to use these much larger EST data for phylogenetic analysis in the way proposed here. It is evident from our analysis that phylogenetic inferences will suffer from the unexpectedly high level of paralogy affecting most of the highly expressed loci, unless paralogy groups whose origin precedes the separation of the focal taxa can be separated a priori (Philippe, Lartillot, and Brinkmann 2005; Rodriguez-Ezpeleta et al. 2005).

Many questions remain for the use of the broad EST approach, for example, which molecular techniques are most suitable for enriching the desired loci prior to sequencing or the utility of tissue-specific libraries to reduce the recovery of paralogous sequences. For example, libraries of P. dardanus were obtained from wing discs and included a much higher proportion of RPs than most of the other libraries which were obtained from total adult tissue (table 2). Furthermore, for comparative studies, bidirectional sequencing of ESTs and careful curation of redundant sequences is important to mitigate problems otherwise introduced by sequencing and partial gene sequences. However, the best strategy might be to sequence the majority of ESTs in a single direction and only sequence the reverse direction when the full length of specific genes is missing.

RP genes apparently were little affected by recent paralogy and provide a formidable resource for deep-level phylogenetics. With 66 genes included here in an analysis of Coleoptera, this represents a great advance over the existing trees from single genes (Howland and Hewitt 1995; Caterino et al. 2002). However, as the matrix includes some 71.4% of missing data, support levels inevitably will be low (Wiens 2003; Hughes and Vogler 2004; Philippe et al. 2004) even if the effect may be less pronounced with a greater number of genes (Driskell et al. 2004). Yet, presenting just under 9,000 unique nuclear sequences, the current study provides a foundation for multilocus phylogenetics of Coleoptera and other insect groups. Dense taxonomic EST sampling will offer us new opportunities for phylogenetic analysis while also providing a less myopic glimpse of the functional and evolutionary diversity in the most species-rich lineage on Earth.

1
Present address: Division of Environmental and Evolutionary Biology, Institute of Biomedical and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow, United Kingdom
2
These authors contributed equally to the work.
Herve Philippe, Associate Editor

We are grateful to Sue Lomas and Francis Wright at the sequencing facilities at Silwood Park and Derek Huntley, James Abbott, and Gail Bartlett from the Bioinformatics Support Service at Imperial College. We thank Hans Pohl, Ignacio Ribera, Michael Balke, and Peter Hammond for contributing insect specimens. We greatly thank Herve Philippe and anonymous reviewers for useful comments, and Miquel Arnedo, Alexandra Cieslak, Jose Galián, Jesus Gómez-Zurita, Fatos Kopliku, and Nathalie Tristem for contributing additional library construction and sequencing. This project was funded by Biotechnology and Biological Sciences Research Council grant 49/G14548 to Michael Caterino, A.P.V. and P.G.F. and a Ph.D. studentship to S.J.L. Additional funding were from the Department of Trade and Industry, United Kingdom and an Alexander S. Onassis foundation scholarship to A.P.

References

Aguinaldo, A. M., J. M. Turbeville, L. S. Linford, M. C. Rivera, J. R. Garey, R. A. Raff, and J. A. Lake.
1997
. Evidence for a clade of nematodes, arthropods and other moulting animals.
Nature
 
387
:
489
–493.
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman.
1990
. Basic local alignment search tool.
J. Mol. Biol.
 
215
:
403
–410.
Bapteste, E., H. Brinkmann, J. A. Lee et al. (11 co-authors).
2002
. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba.
Proc. Natl. Acad. Sci. USA
 
99
:
1414
–1419.
Baum, B. R.
1992
. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees.
Taxon
 
41
:
3
–10.
Beutel, R. G., and F. Haas.
2000
. Phylogenetic relationships of the suborders of Coleoptera (Insecta).
Cladistics
 
16
:
103
–141.
Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges.
2002
. The evolutionary position of nematodes.
BMC Evol. Biol.
 
2
:
7
.
Castresana, J.
2000
. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.
Mol. Biol. Evol.
 
17
:
540
–552.
Caterino, M. S., S. Cho, and F. A. Sperling.
2000
. The current state of insect molecular systematics: a thriving Tower of Babel.
Annu. Rev. Entomol.
 
45
:
1
–54.
Caterino, M. S., T. Hunt, and A. P. Vogler.
2005
. On the constitution and phylogeny of Staphyliniformia (Insecta: Coleoptera).
Mol. Phylogenet. Evol.
 
34
:
655
–672.
Caterino, M. S., V. L. Shull, P. M. Hammond, and A. P. Vogler.
2002
. Basal relationships of Coleoptera inferred from 18S rDNA sequences.
Zool. Scr.
 
31
:
41
–49.
Creevey, C. J., and J. O. McInerney.
2005
. Clann: investigating phylogenetic information through supertree analyses.
Bioinformatics
 
21
:
390
–392.
Crowson, R. A.
1955
. The natural classification of the families of the Coleoptera. Nathaniel Lloyd, London.
———.
1960
. The phylogeny of Coleoptera.
Annu. Rev. Entomol.
 
5
:
111
–134.
Dopazo, H., and J. Dopazo.
2005
. Genome-scale evidence of the nematode-arthropod clade.
Genome Biol.
 
6
:
R41
.
Driskell, A. C., C. Ane, J. G. Burleigh, M. M. McMahon, B. C. O'Meara, and M. J. Sanderson.
2004
. Prospects for building the tree of life from large sequence databases.
Science
 
306
:
1172
–1174.
Eigenheer, A. L., C. I. Keeling, S. Young, and C. Tittiger.
2003
. Comparison of gene representation in midguts from two phytophagous insects, Bombyx mori and Ips pini, using expressed sequence tags.
Gene
 
316
:
127
–136.
Erwin, T. L.
1982
. Tropical forests: their richness in Coleoptera and other arthropod species.
Coleopt. Bull.
 
36
:
74
–75.
Ewing, B., and P. Green.
1998
. Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome Res.
 
8
:
186
–194.
Ewing, B., L. Hillier, M. C. Wendl, and P. Green.
1998
. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome Res.
 
8
:
175
–185.
Gatesy, J., M. Milinkovitch, V. Waddell, and M. Stanhope.
1999
. Stability of cladistic relationships between Cetacea and higher-level artiodactyl taxa.
Syst. Biol.
 
48
:
6
–20.
Guindon, S., and O. Gascuel.
2003
. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.
Syst. Biol.
 
52
:
696
–704.
Hammond, P. M.
1992
. Species inventory. Pp. 17–39 in B. Groombridge, ed. Global biodiversity, status of the Earth's living resources. Chapman and Hall, London.
Hedges, S. B., J. E. Blair, M. L. Venturi, and J. L. Shoe.
2004
. A molecular timescale of eukaryote evolution and the rise of complex multicellular life.
BMC Evol. Biol.
 
4
:
2
.
Hennig, W.
1981
. Insect phylogeny. Academic Press, New York.
Howland, D. E., and G. M. Hewitt.
1995
. Phylogeny of the Coleoptera based on mitochondrial cytochrome oxidase I sequence data.
Insect Mol. Biol.
 
4
:
203
–215.
Hsiao, L. L., F. Dangond, T. Yoshida et al. (22 co-authors).
2001
. A compendium of gene expression in normal human tissues.
Physiol. Genomics
 
7
:
97
–104.
Huelsenbeck, J. P., and F. Ronquist.
2001
. MRBAYES: Bayesian inference of phylogenetic trees.
Bioinformatics
 
17
:
754
–755.
Hughes, J., and A. P. Vogler.
2004
. The phylogeny of acorn weevils (genus Curculio) from mitochondrial and nuclear DNA sequences: the problem of incomplete data.
Mol. Phylogenet. Evol.
 
32
:
601
–615.
Korte, A., I. Ribera, R. G. Beutal, and D. Bernhard.
2004
. Interrelationships of Staphyliniform groups inferred from 18S and 28S rDNA sequences, with special emphasis on Hydrophiloidea (Coleoptera, Staphyliniformia).
J. Zool. Syst. Evol. Res.
 
42
:
281
–288.
Kukalova-Peck, J., and J. F. Lawrence.
1993
. Evolution of the hind wing in Coleoptera.
Can. Entomol.
 
125
:
181
–258.
Landais, I., M. Ogliastro, K. Mita, J. Nohata, M. Lopez-Ferber, M. Duonor-Cerutti, T. Shimada, P. Fournier, and G. Devauchelle.
2003
. Annotation pattern of ESTs from Spodoptera frugiperda Sf9 cells and analysis of the ribosomal protein genes reveal insect-specific features and unexpectedly low codon usage bias.
Bioinformatics
 
19
:
2343
–2350.
Lawrence, J. F., and A. F. Newton Jr.
1995
. Families and subfamilies of Coleoptera (with selected genera, notes, references and data on family-group names). Pp. 779–913 in J. Palaluk and S. A. Slipinski, eds. Biology, phylogeny and classification of Coleoptera. Papers celebrating the 80th birthday of Roy A. Crowson. Muzeum I Instytut Zoologii PAN, Warsaw, Poland.
Li, L., B. P. Brunk, J. C. Kissinger et al. (20 co-authors).
2003
. Gene discovery in the apicomplexa as revealed by EST sequencing and assembly of a comparative gene database.
Genome Res.
 
13
:
443
–454.
Mita, K., M. Morimyo, K. Okano et al. (12 co-authors).
2003
. The construction of an EST database for Bombyx mori and its application.
Proc. Natl. Acad. Sci. USA
 
100
:
14121
–14126.
Murphy, W. J., E. Eizirik, S. J. O'Brien et al. (11 co-authors).
2001
. Resolution of the early placental mammal radiation using Bayesian phylogenetics.
Science
 
294
:
2348
–2351.
Olmstead, R. G., and J. A. Sweere.
1994
. Combining data in phylogenetic systematics—an empirical approach using 3 molecular data sets in the Solanaceae.
Syst. Biol.
 
43
:
467
–481.
Parkinson, J., A. Anthony, J. Wasmuth, R. Schmid, A. Hedley, and M. Blaxter.
2004
a. PartiGene—constructing partial genomes.
Bioinformatics
 
20
:
1398
–1404.
Parkinson, J., and M. Blaxter.
2003
. SimiTri—visualizing similarity relationships for groups of sequences.
Bioinformatics
 
19
:
390
–395.
———.
2004
. Expressed sequence tags: analysis and annotation.
Methods Mol. Biol.
 
270
:
93
–126.
Parkinson, J., D. B. Guiliano, and M. Blaxter.
2002
. Making sense of EST sequences by CLOBBing them.
BMC Bioinformatics
 
3
:
31
.
Parkinson, J., M. Mitreva, C. Whitton et al. (12 co-authors).
2004
b. A transcriptomic analysis of the phylum Nematoda.
Nat. Genet.
 
36
:
1259
–1267.
Philip, G. K., C. J. Creevey, and J. O. McInerney.
2005
. The Opisthokonta and the Ecdysozoa may not be clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa.
Mol. Biol. Evol.
 
22
:
1175
–1184.
Philippe, H., N. Lartillot, and H. Brinkmann.
2005
. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia.
Mol. Biol. Evol.
 
22
:
1246
–1253.
Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. Holland, and D. Casane.
2004
. Phylogenomics of eukaryotes: impact of missing data on large alignments.
Mol. Biol. Evol.
 
21
:
1740
–1752.
Pons, J., T. Barraclough, K. Theodorides, A. Cardoso, and A. Vogler.
2004
. Using exon and intron sequences of the gene Mp20 to resolve basal relationships in Cicindela (Coleoptera: Cicindelidae).
Syst. Biol.
 
53
:
554
–570.
Ragan, M. A.
1992
. Matrix representation in reconstructing phylogenetic relationships among the eukaryotes.
Biosystems
 
28
:
47
–55.
Rodriguez-Ezpeleta, N., H. Brinkmann, S. C. Burey, B. Roure, G. Burger, W. Loffelhardt, H. J. Bohnert, H. Philippe, and B. F. Lang.
2005
. Monophyly of primary photosynthetic eukaryotes: green plants, red algae, and glaucophytes.
Curr. Biol.
 
15
:
1325
–1330.
Rubin, G. M., L. Hong, P. Brokstein, M. Evans-Holm, E. Frise, M. Stapleton, and D. A. Harvey.
2000
. A Drosophila complementary DNA resource.
Science
 
287
:
2222
–2224.
Rudd, S.
2003
. Expressed sequence tags: alternative or complement to whole genome sequences?
Trends Plant Sci.
 
8
:
321
–329.
Sorenson, M. D.
1999
. TREEROT. Version 2c. Boston University, Boston.
Teeling, E. C., M. S. Springer, O. Madsen, P. Bates, J. O'Brien S, and W. J. Murphy.
2005
. A molecular phylogeny for bats illuminates biogeography and the fossil record.
Science
 
307
:
580
–584.
Theodorides, K., A. De Riva, J. Gomez-Zurita, P. G. Foster, and A. P. Vogler.
2002
. Comparison of EST libraries from seven beetle species: towards a framework for phylogenomics of the Coleoptera.
Insect Mol. Biol.
 
11
:
467
–475.
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins.
1997
. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools.
Nucleic Acids Res.
 
25
:
4876
–4882.
Wheeler, W. C., M. Whiting, Q. D. Wheeler, and J. M. Carpenter.
2001
. The phylogeny of the extant hexapod orders.
Cladistics
 
17
:
113
–169.
Wiens, J. J.
2003
. Missing data, incomplete taxa, and phylogenetic accuracy.
Syst. Biol.
 
52
:
528
–538.
Zdobnov, E. M., C. von Mering, I. Letunic et al. (36 co-authors).
2002
. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster.
Science
 
298
:
149
–159.

Author notes

*Department of Entomology, The Natural History Museum, London, United Kingdom; †Department of Biological Sciences, Imperial College London, Silwood Park Campus, Ascot, United Kingdom; and ‡Department of Zoology, The Natural History Museum, London, United Kingdom

Supplementary data