We used an expressed sequence tag and 454 pyrosequencing approach to initiate a study of the genome of the screwworm, Cochliomyia hominivorax (Coquerel) (Diptera: Calliphoridae). Two normalized cDNA libraries were constructed from RNA isolated from embryos and second instar larvae from the Panama 95 strain. Approximately 5,400 clones from each library were sequenced from both the 5′ and 3′ directions using the Sanger method. In addition, double-stranded cDNA was prepared from random-primed polyA RNA purified from embryos, second-instar larvae, adult males, and adult females. These four cDNA samples were used for 454 pyrosequencing that produced ≈300,000 independent sequences. Sequences were assembled into a database of assembled contigs and singletons and used to search public protein databases and annotate the sequences. The full database consists of 6,076 contigs and 58,221 singletons assembled from both the traditional expressed sequence tag (EST) and 454 sequences. Annotation of the data led to the identification of several gene coding regions with possible roles in sex determination in the screwworm. This database will facilitate the design of microarray and other experiments to study screwworm gene expression on a larger scale than previously possible.
The screwworm, Cochliomyia hominivorax (Coquerel) (Diptera: Calliphoridae), is a severe pest of livestock and other warm-blooded animals, including humans, in parts of Central and South America. The female fly deposits its eggs at a wound site and, upon hatching, the larvae feed on the host’s living tissue. During seasons of heavy screwworm fly infestation, death of the host animal will often ensue if protective measures are not taken. The screwworm has been eradicated from North America and reinfestation is prevented by maintenance of a biological barrier zone formed by the ongoing mass release of sterilized male flies in Panama (Vargas-Teran 1991; http://www.nal.usda.gov/speccoll/collect/screwworm/Start.htm). Before the eradication of the screwworm from the United States, annual losses to the U.S. livestock industry topped $120 million in the 1950s (Graham and Hourrigan 1977). The benefits of the eradication program to the United States were estimated to be >$1 billion in 1974 (Goodwin 1974). Annual costs of maintaining the USDA-Regional Screwworm Program are approximately $31 million, and several avenues of increasing program efficiency are being explored. One of these areas of research involves the potential use of transgenic female-conditional lethal strains of screwworm. The development of this type of fly strain has been successfully demonstrated in Drosophila melanogaster (Meigen) (Thomas et al. 2000) and Ceratitus capitata (Wiedemann) (Fu et al. 2007), and their use in control programs is being actively explored. However, these two fly species have a wealth of molecular and genetic information in public databases, whereas molecular and gene sequence data for screwworm is lacking, with only 29 sequence records available in GenBank and no expressed sequence tag (EST) entries (queried 28 October 2008).
Genome sequencing and focused EST acquisition projects have greatly advanced the study of a wide range of biological organisms and can provide new opportunities to solve agricultural problems and guide searches for novel control technologies. The size of the screwworm genome has not been reported and the likelihood of obtaining whole genome sequence in the near future is low. EST-based projects will accelerate the development of new approaches to study the genome and molecular biology of this fly. As a component of a project directed toward development of female-conditional lethal strains of transgenic screwworm, we have developed a database of genes expressed in the screwworm. We used a traditional cloned cDNA library EST-based approach to identify genes expressed in the embryonic and larval stages of the screwworm life cycle. Subsequently, a 454 pyrosequencing approach was used to obtain sequences from expressed transcripts in screwworm embryos, larvae, adult males and adult females. Bioinformatics tools were used to assemble all the sequences into a single expressed transcript database and analyze the data for sequences related to sex determination genes of other organisms which might prove useful for developing the female-conditional lethal strains.
Materials and Methods
All egg and larval materials were from the Panama 95 screwworm strain reared at the Pacora screwworm production plant in Panama according to established protocols (Chaudhury and Skoda 2007). Adult flies were obtained from the Panama 95 screwworm strain reared similarly at the screwworm production plant in Tuxtla Gutierrez, Chiapas, Mexico. One gram of eggs, second-instar larvae, and 10 g of adults of mixed sex were collected and immediately frozen on dry ice and maintained at –80°C. Adult flies were separated by sex on dry ice and kept frozen until pulverized during the nucleic acid isolation procedure.
Nucleic Acid Manipulations.
Two grams each of adult males and adult females, 0.5 g of eggs, and 1 g of second-instar larvae were used for isolation of total and poly(A)+ RNA. Ten milliliters of RNAlater-ICE (Ambion, Austin, TX) was added to each sample according to manufacturer’s instructions. Total RNA was isolated using the Totally RNA kit reagents with a lithium chloride precipitation step (Ambion). RNA integrity was verified by formaldehyde gel electrophoresis and staining in GelStar Nucleic Acid Gel Stain (Lonza Rockland, Inc., Rockland, ME). Total RNA was used to construct a standard normalized cDNA library under contract by Express Genomics, Inc. (Frederick, MD), by using proprietary protocols. In brief, cDNA synthesis was primed using an oligo-dTVN primer and double-stranded cDNA directionally cloned into the EcoRV-NotI site of pExpress1 after size selection on an agarose gel. cDNA normalization was carried out using biotinylated RNA driver transcribed from the cloned cDNA and hybridization with single-stranded cDNA circles generated by phagemid production. After hybrid removal using streptavidin, single-stranded circles were converted into double-stranded DNA and then electroporated into T1 phage resistant DH10b Escherichia coli cells.
Sequencing of ESTs from the two normalized libraries was performed at the J. Craig Venter Institute (Rockville, MD). Bacterial colonies were picked for template preparation using colony-picking robots (Genetix, Boston, MA), inoculated into 384-well plates containing liquid medium and grown overnight. A robotic work station was used to prepare sequencing grade plasmid DNA by using the alkaline lysis method (Sambrook et al. 1989) modified for high-throughput processing. Beckman Multimek 96 or Biomek FX automated pipetting robot work stations (Beckman Coulter, Fullerton, CA) were used to combine prealiquoted templates and reaction mixes consisting of deoxy- and fluorescently labeled dideoxynucleotides, TaqDNA polymerase, sequencing primers, and sequencing reaction buffer. Linear amplification steps were performed on Tetrads PTC-225 (MJ Research, Watertown, MA), and sequencing reaction products were purified by ethanol precipitation and resolved on ABI 3730xl sequencing machines (Applied Biosystems, Foster City, CA). The unassembled EST sequences were submitted to GenBank dbEST with the accession numbers FG282693-FG291953 (eggs) and FG291954–FG301340 (larvae).
PolyA RNA was isolated from two 400-μg aliquots of total RNA from egg, second-instar larvae, adult males and adult females using the MicroPoly(A) Purist kit (Ambion). Five micrograms of polyA RNA was used to synthesize blunt-ended double-stranded cDNA by using the Just cDNA Double-Stranded cDNA Synthesis kit (Stratagene, La Jolla, CA) and the supplied random primers. DNA sample preparation for sequencing on the 454/Roche GS-FLX was performed as described by the manufacturer (Margulies et al. 2005). In brief, this entailed shearing the DNA via nebulization, end repairing as described by Roe (2004), followed by ligation of adapter sequences and a second round of end repair to yield a blunt-ended DNA library that was quantified and diluted before amplification via emulsion polymerase chain reaction (emPCR) (Margulies et al. 2005). After emPCR enrichment, the DNAs were loaded onto a 454/Roche GE-FLX for massively parallel pyrosequencing. The resulting sequence data in Standard Flowgram Format (sff) were assembled using a beta version of SeqMan NGen (DNAstar, Madison, WI) at 97% homology, match size 15, match spacing 24, and minimum sequence length of 100 bp. Assemblies were evaluated for false joins visually using SeqManPro (DNAstar). The raw 454 sequence data were submitted to GenBank’s Short Trace Archive (accession no. SRA001820).
Bioinformatic Analysis of ESTs.
In the database of assembled ESTs, sequences were mapped to functional classifications schemes such as gene ontology (GO) terms (Ashburner et al. 2000), Kyoto Encyclopedia of Genes and Genomes pathways (Ogata et al. 1999), Swiss Prot Protein Keywords (Bairoch et al. 2005), and the Database for Annotation, Visualization, and Integrated Discovery (Dennis et al. 2003). For the GO term comparisons with the C. hominivorax developmental stage data sets, the Flybase D. melanogaster 2008_09 Release GO term annotations was used in conjunction with Fisher exact test for determination of GO term distribution differences.
Results and Discussion
To facilitate gene discovery and gene expression studies in embryonic and early larval stages of the screwworm, we constructed two normalized cDNA libraries from total RNA isolated from eggs and second-instar larvae. The initial nonnormalized cDNA libraries consisted of a total of 2.16 × 106 and 3.00 × 106 colony-forming units (cfu) for the egg and larval libraries, respectively. After normalization, a total of 9.6 ×107 cfu, 87% recombinants with average insert size of 1.8 kb, were recovered from the egg library. A total of 8.40 × 107 cfu, 87% recombinants with average insert size of 1.7 kb, were recovered from the larval library. Normalization resulted in a 40- and 50-fold reduction in clones from the egg and larval libraries, respectively, which hybridized to an actin probe when compared with the number hybridizing from the primary libraries (quality control tests performed by Express Genomics Inc.; data not shown). The pUC/M13Reverse and pUC/M13Forward primers flank the pExpress1 multiple cloning site so as to generate 5′ and 3′ end sequence data from the cDNA clone’s insert, respectively. We sequenced 14 384-well plates from each library in both the 5′ and 3′ directions, obtaining 9,261 and 9,387 sequences from the egg and larval libraries, respectively (Table 1). Sequencing of the egg cDNA library resulted in 4,212 and 5,049 successful forward and reverse direction sequence reactions, respectively. Sequencing of the larval cDNA library resulted in 4,503 and 4,884 successful forward and reverse direction sequence reactions, respectively. The sequencing success rate, mean read length, and % G + C values were similar in both libraries, although all three parameters were lower in both forward (3′ end) direction data sets than the corresponding reverse direction data sets (Table 1). The lower sequencing reaction success rates and mean read lengths for the forward reactions is probably due to problems presenting from polyA tracts at or near the 3′ end of the ESTs. In addition, 3′ ends of gene coding regions tend to be more A-T rich, thus lowering the % G + C for the forward data sets.
The 454 sequencing approach on the random-primed egg, larval, adult male, and adult female cDNAs yielded a total of 46,695, 64,686, 62,853, and 49,650 sequences, respectively, that assembled into 3,091, 2,255, 3,801, and 2,432 contigs, respectively. Comparing the data from the egg and larval ESTs, which were obtained by traditional Sanger sequencing, with the data obtained from the egg and larval 454 sequencing contigs, there is an obvious difference in mean length (Table 1). The average length of the forward and reverse direction EST sequencing reads from the egg and larval normalized libraries is ≈970 bp, whereas the mean length of the assembled contigs from the egg and larval 454 data are ≈350 bp. The average length from the individual 454 sequencing reads was ≈230 bp (data not shown). However, the % G + C from the two sequencing methodologies was essentially the same. The average % G + C when combining the forward and reverse direction reads from the egg ESTs was 33.8%, whereas the 454 contig data had 32% G + C. The average % G + C from the forward and reverse direction reads of the larval ESTs was 35.6%, whereas the 454 contig data had 36% G + C.
Finally, the 18,648 EST and 223,884 individual 454 sequences also were assembled into a combined screwworm database of 6,076 assembled contigs and 58,221 singletons (Table 1 and Supplemental Data) and used to search public protein databases and annotate the sequences. The mean length of the contigs and singletons in the combined data set were 1,194 and 237 bp, respectively. The % G+C of the contigs and singletons was 34 and 30%, respectively. There were 4,059 and 3,423 BlastX hits to the 15 February 2008 UniProt database (The UniProt Consortium 2008) using cut-off e-values of 0.001 and 1 × e-25, respectively. See Supplemental Data for the annotations accompanying the assembled contigs with UniProt hits. The latest release of the annotated D. melanogaster genome indicates 15,140 genes located to the genome sequence (release 5.12; http://flybase.org/static_pages/docs/release_notes.html). Assuming a similar number of genes exist in screwworm, it is possible the combined database of 6,076 contigs represents up to 40% of the protein coding genes of the screwworm. Of course, it is possible that some of the screwworm contigs represent different transcripts from the same gene coding region or that some contigs are incorrectly maintained as separate entries in the database when they rightfully should be assembled into a single contig. Database inaccuracies arising from these problems are correctable by manual curation and will be brought to light as the database is used. Of the 4,059 contigs with BlastX hits to UniProt entries (e-value <0.001), 1,792 are to D. melanogaster (see Supplemental Data), not surprisingly the species with the most UniProt BlastX hits to the combined screwworm database contigs. The next two most frequently matched organisms were mouse and human with 472 and 461 BlastX hits to UniProt, respectively. Because C. hominivorax and D. melanogaster are both dipterans, we performed a BlastX analysis of the egg, larval, adult male, and adult female databases, by using the latest annotation of the D. melanogaster genome as the reference database. For this analysis, we assembled the combined 454- and Sanger-derived data from the individual life stages to produce combined egg, combined larval, combined adult male, and combined adult female data sets. The annotations and contigs are presented as Supplemental Data.
As an overall characterization of the combined data set, we determined the frequency of occurrence of the GO terms. In the combined data set, 2,831, 3,272, and 3,546 contigs had GO terms from the cellular component, biological process, and molecular function ontologies, respectively. Table 2 lists the top 20 most commonly found GO terms in each of the three ontologies, whereas a list of the top 100 GO terms is found in the Supplemental Data. It is striking that eight of the top 20 cellular component GO terms relate to organelle, presumably the mitochondrion complement of genes as mitochondrion is noted in 23% of the genes with cellular component GO terms. Five GO terms that relate to membranes are in the cellular component top 20. The biological process GO terms top 20 list has eight terms that relate to metabolism, probably reflecting that larval-expressed genes are a significant part of the data set. Larval stages of the screwworm are voracious feeders and metabolism of the meal for energy generation and growth processes is expected to be a significant activity of that life stage. Half of the top 20 molecular function GO terms are related to binding, including protein, nucleic acid, nucleotide, purine nucleotide, ion, metal ion, cation, and adenyl nucleotide and ATP binding. Seventy percent of the combined data set genes with molecular function GO terms contained the term "binding."
Because a significant portion of the data set is derived from the egg and larval cDNA libraries that were normalized to increase the relative abundance of rare transcripts, the distribution of GO terms might be different from the distribution which would be found in a data set containing the entire set of expressed genes. To examine how the screwworm’s combined data set GO terms might differ from that of two dipterans, D. melanogaster and Aedes aegypti (L.) and one hymenopteran, Apis mellifera L., we determined the top 10 GO terms from each of the three ontologies and noted where each of these GO terms ranked in the screwworm combined data set (Table 3). The GO terms for D. melanogaster release 11.0 (14 June 2006), A. aegypti release 5.0 (25 April 2008), and A. mellifera release 5.0 (3 May 2008) were obtained from The Gene Index Project (http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/). The identity and order of the top 10 Cellular Component GO Terms of A. aegypti and A. mellifera are identical, whereas the identity of the top 10 biological process and molecular function GO terms are identical in the two species, although some terms have slightly different rankings in the top 10 list. Surprisingly, the top 10 GO term list of D. melanogaster has several differences compared with Ae. aegypti and A. mellifera. Protein complex, cellular component unknown, extracellular matrix, and obsolete cellular component are D. melanogaster cellular component GO terms that are not in the top 10 list of the other two insects. Physiological process, regulation of biological process, biological process unknown, interaction between organisms, and pigmentation are D. melanogaster biological process GO terms that are not in the top 10 list of the other two insects. Molecular function unknown, signal transducer activity, and obsolete molecular function are D. melanogaster molecular function GO terms that are not in the top 10 list of the other two insects. Table 3 shows where each of the top 10 GO terms from D. melanogaster, Ae. aegypti, and A. mellifera would rank in the listing of the GO terms from the screwworm combined data set annotations. Many of the most common GO terms of Table 3 have greatly different rankings in the screwworm data set. For example, the biological process GO term reproduction ranks in the top seven for all three species of Table 3, whereas it ranks 56 in the screwworm data set. The molecular function GO term transcriptional regulator activity ranks five in all three species of Table 3, yet ranks 35 in the screwworm data set. Other examples can be seen in Table 3. In fact, only the top two or three ranking GO Terms from Table 3 have similar rankings in the screwworm data set. These major differences probably reflect that two of the prime sources of the screwworm data set are assembled contigs and unassembled singletons derived from two normalized cDNA libraries prepared from egg and larval life stages. These egg and larval contigs probably served as seed sequences to the collection of 454-derived sequences, allowing more contigs to be assembled from egg and larval 454 sequences, skewing the GO terms and rankings toward ontological classifications from egg- and larval-expressed transcripts. The assembled contigs from The Gene Index Project that were used to obtain the D. melanogaster, Ae. aegypti, and A. mellifera GO terms were almost all derived from adult cDNAs.
We also used the GO annotations to estimate how comprehensive the screwworm combined data set might be as compared with the well-annotated D. melanogaster genome. The GO consortium shows a total of 15,601, 8,424, and 2,230 biological process, molecular function, and cellular component GO Terms as of 30 October 2008 (http://www.geneontology.org/GO.downloads.ontology.shtml). The FlyBase 2008_09 D. melanogaster release was used to determine that the most recent annotation of the D. melanogaster genome makes use of 2,401, 1,759, and 532 of the total biological process, molecular function, and cellular component GO terms, respectively. Assuming the screwworm and D. melanogaster have a similar complement of genes and gene number, comparing the GO term use of our screwworm database to that of the D. melanogaster genome annotations could give an indication of what percentage of the screwworm gene count is present in our combined database. Our database annotations of the 6,076 contigs uses 1,658, 1,054, and 453 of the total biological process, molecular function, and cellular component GO terms, respectively, yielding respective percentages of the total GO terms used as 69, 60, and 85%. Although we would not contend the screwworm combined data set contains these percentages of the total gene count expected from the genome, these data indicate that our data set contains a significant fraction of the expressed genes of the screwworm. As noted, direct GO term comparisons of the D. melanogaster and screwworm combined data set are likely affected by the significant number of transcripts in the screwworm data set that are derived from embryos and larvae.
To evaluate how the different life stages of C. hominivorax varied from each other and from D. melanogaster, we used the GO annotations associated with the four developmental stage sequence data sets and D. melanogaster. The number of times each GO term occurred in the D. melanogaster, combined egg, combined larval, combined adult male, and combined adult female data sets was tallied. The tallies were compared between D. melanogaster and each of the combined egg, combined larval, combined adult male, and combined adult female data sets to identify GO terms that were more abundant in each C. hominivorax data set compared with D. melanogaster. The entire analytical results, including P values from Fisher exact test, are presented in Supplemental Data. Figure 1 depicts the 20 most overrepresented molecular function GO terms from the combined adult female data set and noting their occurrence in the other three C. hominivorax data sets. The results from the four data sets are very consistent for 15- of the 20-most overrepresented molecular function GO terms, indicating the four life stages vary from D. melanogaster in a similar manner for these molecular functions. However, catalytic activity, cation binding, transmembrane transporter activity, and molecular transducer activity were not overrepresented in the combined egg data set. In addition, catalytic activity and transferase activity were not overrepresented in the combined larval data set. Thus, from this type of GO term analysis, the egg stage of C. hominivorax seems to differ least from D. melanogaster.
A major objective of this DNA sequence acquisition project was to identify screwworm orthologs to insect sex determination pathway genes. By a search of the combined data set annotations, we identified several sequences that had similarity to UniProt entries that are involved in arthropod sex determination pathways. Some of the best characterized sex determination genes from D. melanogaster are transformer, doublesex, and sex-lethal (Saccone et al. 2002), and we searched the gene names list from the combined data set contigs (see Supplemental Data) for these specific names. Of particular interest is contig 17682 identified as sex-lethal and possessing high similarity to the sex-lethal ortholog in Chrysomya rufifacies (Macquart), a member of the blowfly family (Müller-Holtkamp 1995). A directed search of GenBank also found a sex-lethal ortholog from Lucilia cuprina (Wiedemann) and a ClustalX alignment (Thompson et al. 1997) of the translated screwworm sequence and the L. cuprina and C. rufifacies sequences demonstrated the region of high amino acid identity extends over most of the protein sequence (Fig. 2). Because contig 17682 is from an assembly of ESTs that originated from single-pass sequencing, it is possible there are errors in the contig’s sequence that can affect the reading frame of the sequence. In deriving the alignment for Fig. 1, we manually adjusted the sequence of contig 17682 to optimize the alignment with the other two blowfly sequences. This was done by reverse complementing and translating contig 17682 in frame two followed by deleting the resulting nucleotides 968 (T), 1003 (A), and 1017 (T).
Using the GO term annotations (see Supplemental Data), we identified entries in the screwworm combined data set that had biological process GO term classifications containing male sex determination (GO:0030238), somatic sex determination (GO:0018993), or female sex differentiation (GO:0046660), and these are listed in Table 4. Contig 59623 has moderate similarity (e value = 1.78e–6) to a testis-determining gene responsible for male development in mice sry (Gubbay et al. 1990; GenBank accession no. Q05738). Contig 9469 has high similarity (e value = 3.21e–63) to a D. melanogaster protein Achaete-Scute complex protein T4 that interacts with the D. melanogaster sex-determination protein sex-lethal (Torres and Sanchez 1989; GenBank accession no. P10084). Contigs 11235, 7414, 9796, and 6976 have high similarity to hyperplastics discs, a gene involved in germ cell development (Mansfield et al. 1994). Unfortunately, sequences with similarity to doublesex or transformer were not found in our database and isolation of these orthologs will require either deeper sequencing of our cDNAs or use of PCR in conjunction with degenerate primers designed from orthologs from other species.
We thank Steve Skoda and Pamela Phillips for collection of screwworm material, Susan Tweedie for help with our questions on FlyBase data, David Hill for help from the GO Consortium’s help desk, and the sequencing core personnel at the J. Craig Venter Institute Joint Technology Center.