-
PDF
- Split View
-
Views
-
Cite
Cite
Tal Naggan Perl, Bernhard G.M. Schmid, Jonas Schwirz, Ariel D. Chipman, The Evolution of the knirps Family of Transcription Factors in Arthropods, Molecular Biology and Evolution, Volume 30, Issue 6, June 2013, Pages 1348–1357, https://doi.org/10.1093/molbev/mst046
- Share Icon Share
Abstract
The orphan nuclear receptor gene knirps and its relatives encode a small family of highly conserved proteins. We take advantage of the conservation of the family, using the recent prevalence of genomic data, to reconstruct its evolutionary history, identifying duplication events and tracing the intron–exon structure of the genes over evolution. Many arthropod species have two or three members of this family, but the orthology between members is unclear. We have analyzed the protein coding sequences of members of this family from 15 arthropod species covering all four main arthropod classes, including a total of 28 genes. All members of the family encode a highly conserved 94 amino acid core sequence, part of which is encoded by a single invariant exon. We find that many of the automated predictions of these genes contain errors, while some copies of the gene were not uncovered by automated pipelines, requiring manual corrections and curation. We use the coding sequences to present a phylogenetic analysis of the knirps family. Our analysis indicates that there was a duplication of a single ancestral gene in the lineage leading to insects, which gave rise to two paralogs, eagle and knirps-related. Descendants of this duplication can be identified by the presence or absence of a short protein-coding motif. Independent, lineage-specific duplications occurred in the two crustaceans we sampled. Within the insects, the knirps-related gene underwent further lineage-specific duplications, giving rise to—among others—the Drosophila gap gene knirps.
Introduction
The knirps gene and its relatives (also known as NR0A family) encode an orphan nuclear hormone receptor with a C4 zinc finger motif, which is distinguished by the lack of a ligand-binding domain. The encoded protein has a highly conserved core of 94 amino acids shared by all homologs of all organisms analyzed (fig. 1). This core contains the DNA-binding domain (Gerwin et al. 1994) and is not encoded by any other genes outside of the knirps family. Many arthropod species have two or three members of this family, whereas others have only one, but the orthology and functional division between members is unclear.

An alignment of the 94 amino acid residues that make up the core sequence of all 28 genes of the Knirps family that were used in our analysis. The bounding box indicates the conserved first exon. Isca, Ixodes scapularis; Smar, Strigamia maritima; Dpul, Daphnia pulex; Phum, Pediculus humanus; Apis, Acyrthosiphon pisum; Amel, Apis mellifera; Nvit, Nasonia vitripennis; Bmor, Bombyx mori; Tcas, Tribolium castaneum; Agam, Anopheles gambiae; Cqui, Culex quinquefasciatus; Dmel, Drosophila melanogaster; Rpro, Rhodnius prolixus; Phaw, Parhyale hawaiensis.
The role and structure of different copies of the gene have only been studied in Drosophila melanogaster where knirps was first described. The Drosophila gap gene knirps plays a central role in abdominal development (Nauber et al. 1988) and interacts with the other trunk gap genes as part of the gap gene network. Drosophila has a total of three different paralogs of the gene: the gene pair knirps (kni) and knirps-related (knrl), which are located close to each other on the left arm of the third chromosome (Rothe et al. 1989), and eagle, also known as egon (eg), which also resides on the left arm of the third chromosome but in a distinct chromosomal region (Rothe et al. 1989). The first two share most of their expression patterns and are redundant in function during the development of different tissues such as embryonic head, gut, tracheal system, and adult wing (Gonzalez-Gaitan et al. 1994; Chen et al. 1998; Lunde et al. 2003; Fuß et al. 2001). The third of these, eg, is expressed in the embryonic gonads in Drosophila (Rothe et al. 1989) and has two transcripts that vary only in their UTRs but code for the same protein product. More intriguing is the discovery that the length of the introns in the genes plays an important role in the difference between the paralogs and their functions. One of the major differences between kni and knrl in D. melanogaster is that knrl has a transcription unit which is much longer and contains a large intron that makes the transcription unit 7.5 times longer than that of kni. This intron allows the gene to only be partially transcribed in the early stages of development, because of its long transcription time relative to the rapid nuclear division cycles during early syncytial blastoderm stages (Rothe et al. 1992). An additional difference between the genes of this gene pair is in the regulation of the blastodermal expression domain in the prospective abdomen (Rothe et al. 1994).
A potential role for knirps homologs in the gap gene network is also known from other dipterans, the mosquito Anopheles gambiae (Goltsev et al. 2004) and the moth midge Clogmia albipunctata (García-Solache et al. 2010). In the red flour beetle Tribolium castaneum, knirps (Tc-kni) has been shown to be required for head segmentation (Cerny et al. 2008), while the other homolog (Tc-eg) is expressed maternally and its mRNA is localized to the anterior pole of the egg (Bucher et al. 2005). In the milkweed bug Oncopeltus fasciatus, one of the family members plays a minor role in the gap gene network (Ben-David and Chipman 2010), and no obvious role in head development (unpublished data). Little is known about the role of knirps family genes in other arthropods.
The knirps family of genes makes an interesting case study for molecular evolution, as it is fairly small, usually no more than two to three paralogs per species, with easily identifiable members (Bonneton and Laudet 2012). The evolution of the intron–exon structure is also interesting because of the functional importance of intron length in D. melanogaster. In this contribution we have adopted a bioinformatics approach, making use of the proliferation of fully sequenced arthropod genomes. We show that it is possible to reconstruct the evolutionary history of this gene family, taking advantage of its conservation, focusing on identifying individual duplication events and tracing the intron–exon structure of the gene over evolution. We also point out several pitfalls that are inherent in this type of work and demonstrate the limitations of automatic annotations of newly sequenced genomes.
Results and Discussion
Sequences Obtained
Database searches combined with manual gene structure analysis recovered a total of 28 distinct genes in 15 different organisms. Our assumption is that these sequences represent the full complement of the family in all of these species, although we are aware of the fact that additional paralogs might be discovered in the course of ongoing genome assemblies.
Table 1 summarizes the analyzed organisms and the number of copies of knirps paralogs found in each of them. Most of the organisms are insects, as this class is best represented in sequenced genomes, but we have included representatives of the chelicerates (Ixodes scapularis), the myriapods (Strigamia maritima), and the crustaceans (Daphnia pulex and Parhyale hawaiensis). The database in which each of the genomes was searched is listed in the source column. Specific accession numbers or genome locations are listed in table 2.
Organism . | Common name . | Class/Order . | No. of Gene Copies . | Source . |
---|---|---|---|---|
Acyrthosiphon pisum | Pea aphid | Insecta/Paraneoptera | 3a | NCBI genome blast. Build 2.1 |
Aedes aegypti | Yellow fever mosquito | Insecta/Diptera | 1 | http://aaegypti.vectorbase.org/ |
Anopheles gambiae | African malaria mosquito | Insecta/Diptera | 2 | http://agambiae.vectorbase.org/ |
Apis mellifera | Honey bee | Insecta/Hymenoptera | 3 | NCBI genome blast. Amel 4.5 |
Bombyx mori | Silkworm | Insecta/Lepidoptera | 1 | http://www.silkdb.org/ |
Culex quinquefasciatus | Southern house mosquito | Insecta/Diptera | 2 | http://cquinquefasciatus.vectorbase.org/ |
Daphnia pulex | Water flea | Crustacea/Branchiopoda | 2 | http://wfleabase.org/ |
Drosophila melanogaster | Fruit fly | Insecta/Diptera | 3 | NCBI genome blast. Release 5.30 |
Ixodes scapularis | Black tick | Chelicerata/Acari | 1 | http://iscapularis.vectorbase.org/ |
Nasonia vitripennis | Jewel wasp | Insecta/Hymenoptera | 1 | NCBI genome blast build 2.1 |
Parhyale hawaiensis | Beach hopper | Crustacea/Malacostraca | 2 | Our cloning |
Pediculus humanus | Human louse | Insecta/Paraneoptera | 3 | http://phumanus.vectorbase.org/ |
Rhodnius prolixus | Triatomid bug | Insecta/Paraneoptera | 2 | http://rprolixus.vectorbase.org/ |
Strigamia maritima | Coastal centipede | Myriapoda/Chilopoda | 1 | Strigamia genome project |
Tribolium castaneum | Red flour beetle | Insecta/Coleoptera | 2 | NCBI genome blast build 2.1 |
Organism . | Common name . | Class/Order . | No. of Gene Copies . | Source . |
---|---|---|---|---|
Acyrthosiphon pisum | Pea aphid | Insecta/Paraneoptera | 3a | NCBI genome blast. Build 2.1 |
Aedes aegypti | Yellow fever mosquito | Insecta/Diptera | 1 | http://aaegypti.vectorbase.org/ |
Anopheles gambiae | African malaria mosquito | Insecta/Diptera | 2 | http://agambiae.vectorbase.org/ |
Apis mellifera | Honey bee | Insecta/Hymenoptera | 3 | NCBI genome blast. Amel 4.5 |
Bombyx mori | Silkworm | Insecta/Lepidoptera | 1 | http://www.silkdb.org/ |
Culex quinquefasciatus | Southern house mosquito | Insecta/Diptera | 2 | http://cquinquefasciatus.vectorbase.org/ |
Daphnia pulex | Water flea | Crustacea/Branchiopoda | 2 | http://wfleabase.org/ |
Drosophila melanogaster | Fruit fly | Insecta/Diptera | 3 | NCBI genome blast. Release 5.30 |
Ixodes scapularis | Black tick | Chelicerata/Acari | 1 | http://iscapularis.vectorbase.org/ |
Nasonia vitripennis | Jewel wasp | Insecta/Hymenoptera | 1 | NCBI genome blast build 2.1 |
Parhyale hawaiensis | Beach hopper | Crustacea/Malacostraca | 2 | Our cloning |
Pediculus humanus | Human louse | Insecta/Paraneoptera | 3 | http://phumanus.vectorbase.org/ |
Rhodnius prolixus | Triatomid bug | Insecta/Paraneoptera | 2 | http://rprolixus.vectorbase.org/ |
Strigamia maritima | Coastal centipede | Myriapoda/Chilopoda | 1 | Strigamia genome project |
Tribolium castaneum | Red flour beetle | Insecta/Coleoptera | 2 | NCBI genome blast build 2.1 |
aTwo genes are nearly identical at the DNA sequence level, so only one was used in analysis.
Organism . | Common name . | Class/Order . | No. of Gene Copies . | Source . |
---|---|---|---|---|
Acyrthosiphon pisum | Pea aphid | Insecta/Paraneoptera | 3a | NCBI genome blast. Build 2.1 |
Aedes aegypti | Yellow fever mosquito | Insecta/Diptera | 1 | http://aaegypti.vectorbase.org/ |
Anopheles gambiae | African malaria mosquito | Insecta/Diptera | 2 | http://agambiae.vectorbase.org/ |
Apis mellifera | Honey bee | Insecta/Hymenoptera | 3 | NCBI genome blast. Amel 4.5 |
Bombyx mori | Silkworm | Insecta/Lepidoptera | 1 | http://www.silkdb.org/ |
Culex quinquefasciatus | Southern house mosquito | Insecta/Diptera | 2 | http://cquinquefasciatus.vectorbase.org/ |
Daphnia pulex | Water flea | Crustacea/Branchiopoda | 2 | http://wfleabase.org/ |
Drosophila melanogaster | Fruit fly | Insecta/Diptera | 3 | NCBI genome blast. Release 5.30 |
Ixodes scapularis | Black tick | Chelicerata/Acari | 1 | http://iscapularis.vectorbase.org/ |
Nasonia vitripennis | Jewel wasp | Insecta/Hymenoptera | 1 | NCBI genome blast build 2.1 |
Parhyale hawaiensis | Beach hopper | Crustacea/Malacostraca | 2 | Our cloning |
Pediculus humanus | Human louse | Insecta/Paraneoptera | 3 | http://phumanus.vectorbase.org/ |
Rhodnius prolixus | Triatomid bug | Insecta/Paraneoptera | 2 | http://rprolixus.vectorbase.org/ |
Strigamia maritima | Coastal centipede | Myriapoda/Chilopoda | 1 | Strigamia genome project |
Tribolium castaneum | Red flour beetle | Insecta/Coleoptera | 2 | NCBI genome blast build 2.1 |
Organism . | Common name . | Class/Order . | No. of Gene Copies . | Source . |
---|---|---|---|---|
Acyrthosiphon pisum | Pea aphid | Insecta/Paraneoptera | 3a | NCBI genome blast. Build 2.1 |
Aedes aegypti | Yellow fever mosquito | Insecta/Diptera | 1 | http://aaegypti.vectorbase.org/ |
Anopheles gambiae | African malaria mosquito | Insecta/Diptera | 2 | http://agambiae.vectorbase.org/ |
Apis mellifera | Honey bee | Insecta/Hymenoptera | 3 | NCBI genome blast. Amel 4.5 |
Bombyx mori | Silkworm | Insecta/Lepidoptera | 1 | http://www.silkdb.org/ |
Culex quinquefasciatus | Southern house mosquito | Insecta/Diptera | 2 | http://cquinquefasciatus.vectorbase.org/ |
Daphnia pulex | Water flea | Crustacea/Branchiopoda | 2 | http://wfleabase.org/ |
Drosophila melanogaster | Fruit fly | Insecta/Diptera | 3 | NCBI genome blast. Release 5.30 |
Ixodes scapularis | Black tick | Chelicerata/Acari | 1 | http://iscapularis.vectorbase.org/ |
Nasonia vitripennis | Jewel wasp | Insecta/Hymenoptera | 1 | NCBI genome blast build 2.1 |
Parhyale hawaiensis | Beach hopper | Crustacea/Malacostraca | 2 | Our cloning |
Pediculus humanus | Human louse | Insecta/Paraneoptera | 3 | http://phumanus.vectorbase.org/ |
Rhodnius prolixus | Triatomid bug | Insecta/Paraneoptera | 2 | http://rprolixus.vectorbase.org/ |
Strigamia maritima | Coastal centipede | Myriapoda/Chilopoda | 1 | Strigamia genome project |
Tribolium castaneum | Red flour beetle | Insecta/Coleoptera | 2 | NCBI genome blast build 2.1 |
aTwo genes are nearly identical at the DNA sequence level, so only one was used in analysis.
List of Genes with Transcript and Protein References, and Notes Indicating Whether We Predicted the Protein Manually or Corrected the Existing Prediction.
Gene . | Genomic Location or Transcript Code . | Predicted Protein . | Notes . |
---|---|---|---|
Acyrthosiphon pisum 1 | XM_003243072 | XP_003243120 | Corrected predicted sequence |
Acyrthosiphon pisum 2 | XM_003248946 | XP_003248994 | Highly similar to Acy. pisum3. |
Not used in further analysis | |||
Acyrthosiphon pisum 3 | XM_003244284 | XP_003244332 | |
Aedes aegypti | NW_001811303.1 | XP_001661506.1 | |
Anopheles gambiae 1 | XM_001230803.1 | AGAP010438 | Corrected predicted sequence |
Anopheles gambiae 2 | Chr. 3L 3413051 … 3429794 | No protein predicted | Our prediction |
Apis mellifera 1 | XM_001120531 | XP_001120531 | |
Apis mellifera 2 | XM_395932 | XP_395932 | |
Apis mellifera 3 | XM_001120662 | XP_001120662 | |
Bombyx mori | scaffold 316:325125 … 378308 | No protein predicted | Our prediction |
Culex quinquefasciatus 1 | supercontig 3.87:256000 … 304600 | CPIJ 004741 | Corrected predicted sequence |
Culex quinquefasciatus 2 | supercontig 3.517:186600 … 199900 | CPIJ013614 | Corrected predicted sequence |
Daphnia pulex 1 | scaffold 43: 250370 … 252140 | JGI_V11_290673 | |
Daphnia pulex 2 | scaffold 43: 295311 … 300650 | JGI_V11_290668 | Corrected predicted sequence |
Drosophila melanogaster kni | NM_079463.2 | NP_524187.1 | |
Drosophila melanogaster knrl | NM_176374.1 | NP_788552.1 | |
Drosophila melanogaster eg | NM_079482.3, NM_168938.1 | NP_524206.1, NP_730689.1 | |
Ixodes scapularis | NW_002720141.1 | ISCW008631 | Corrected predicted sequence |
Nasonia vitripennis | XM_001604919.2 | XP_001604969.2 | |
Pediculus humanus 1 | XM_002430215.1 | PHUM468660 | |
Pediculus humanus 2 | XM_002430218.1 | PHUM470190 | |
Pediculus humanus 3 | XM_002430217.1 | PHUM469580 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL563043:6194 … 6510 | RPTMP01442 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL555478:421390 … 596632 | No protein predicted | Our prediction |
Strigamia maritima | scf7180001248850:65900 … 68500 | Smar_temp_012614 | |
Tribolium castaneum 1 (eg) | NM_001114367.1 | NP_001107839.1 | |
Tribolium castaneum 2 (knrl) | NM_001128495.1 | NP_001121967.1 |
Gene . | Genomic Location or Transcript Code . | Predicted Protein . | Notes . |
---|---|---|---|
Acyrthosiphon pisum 1 | XM_003243072 | XP_003243120 | Corrected predicted sequence |
Acyrthosiphon pisum 2 | XM_003248946 | XP_003248994 | Highly similar to Acy. pisum3. |
Not used in further analysis | |||
Acyrthosiphon pisum 3 | XM_003244284 | XP_003244332 | |
Aedes aegypti | NW_001811303.1 | XP_001661506.1 | |
Anopheles gambiae 1 | XM_001230803.1 | AGAP010438 | Corrected predicted sequence |
Anopheles gambiae 2 | Chr. 3L 3413051 … 3429794 | No protein predicted | Our prediction |
Apis mellifera 1 | XM_001120531 | XP_001120531 | |
Apis mellifera 2 | XM_395932 | XP_395932 | |
Apis mellifera 3 | XM_001120662 | XP_001120662 | |
Bombyx mori | scaffold 316:325125 … 378308 | No protein predicted | Our prediction |
Culex quinquefasciatus 1 | supercontig 3.87:256000 … 304600 | CPIJ 004741 | Corrected predicted sequence |
Culex quinquefasciatus 2 | supercontig 3.517:186600 … 199900 | CPIJ013614 | Corrected predicted sequence |
Daphnia pulex 1 | scaffold 43: 250370 … 252140 | JGI_V11_290673 | |
Daphnia pulex 2 | scaffold 43: 295311 … 300650 | JGI_V11_290668 | Corrected predicted sequence |
Drosophila melanogaster kni | NM_079463.2 | NP_524187.1 | |
Drosophila melanogaster knrl | NM_176374.1 | NP_788552.1 | |
Drosophila melanogaster eg | NM_079482.3, NM_168938.1 | NP_524206.1, NP_730689.1 | |
Ixodes scapularis | NW_002720141.1 | ISCW008631 | Corrected predicted sequence |
Nasonia vitripennis | XM_001604919.2 | XP_001604969.2 | |
Pediculus humanus 1 | XM_002430215.1 | PHUM468660 | |
Pediculus humanus 2 | XM_002430218.1 | PHUM470190 | |
Pediculus humanus 3 | XM_002430217.1 | PHUM469580 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL563043:6194 … 6510 | RPTMP01442 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL555478:421390 … 596632 | No protein predicted | Our prediction |
Strigamia maritima | scf7180001248850:65900 … 68500 | Smar_temp_012614 | |
Tribolium castaneum 1 (eg) | NM_001114367.1 | NP_001107839.1 | |
Tribolium castaneum 2 (knrl) | NM_001128495.1 | NP_001121967.1 |
List of Genes with Transcript and Protein References, and Notes Indicating Whether We Predicted the Protein Manually or Corrected the Existing Prediction.
Gene . | Genomic Location or Transcript Code . | Predicted Protein . | Notes . |
---|---|---|---|
Acyrthosiphon pisum 1 | XM_003243072 | XP_003243120 | Corrected predicted sequence |
Acyrthosiphon pisum 2 | XM_003248946 | XP_003248994 | Highly similar to Acy. pisum3. |
Not used in further analysis | |||
Acyrthosiphon pisum 3 | XM_003244284 | XP_003244332 | |
Aedes aegypti | NW_001811303.1 | XP_001661506.1 | |
Anopheles gambiae 1 | XM_001230803.1 | AGAP010438 | Corrected predicted sequence |
Anopheles gambiae 2 | Chr. 3L 3413051 … 3429794 | No protein predicted | Our prediction |
Apis mellifera 1 | XM_001120531 | XP_001120531 | |
Apis mellifera 2 | XM_395932 | XP_395932 | |
Apis mellifera 3 | XM_001120662 | XP_001120662 | |
Bombyx mori | scaffold 316:325125 … 378308 | No protein predicted | Our prediction |
Culex quinquefasciatus 1 | supercontig 3.87:256000 … 304600 | CPIJ 004741 | Corrected predicted sequence |
Culex quinquefasciatus 2 | supercontig 3.517:186600 … 199900 | CPIJ013614 | Corrected predicted sequence |
Daphnia pulex 1 | scaffold 43: 250370 … 252140 | JGI_V11_290673 | |
Daphnia pulex 2 | scaffold 43: 295311 … 300650 | JGI_V11_290668 | Corrected predicted sequence |
Drosophila melanogaster kni | NM_079463.2 | NP_524187.1 | |
Drosophila melanogaster knrl | NM_176374.1 | NP_788552.1 | |
Drosophila melanogaster eg | NM_079482.3, NM_168938.1 | NP_524206.1, NP_730689.1 | |
Ixodes scapularis | NW_002720141.1 | ISCW008631 | Corrected predicted sequence |
Nasonia vitripennis | XM_001604919.2 | XP_001604969.2 | |
Pediculus humanus 1 | XM_002430215.1 | PHUM468660 | |
Pediculus humanus 2 | XM_002430218.1 | PHUM470190 | |
Pediculus humanus 3 | XM_002430217.1 | PHUM469580 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL563043:6194 … 6510 | RPTMP01442 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL555478:421390 … 596632 | No protein predicted | Our prediction |
Strigamia maritima | scf7180001248850:65900 … 68500 | Smar_temp_012614 | |
Tribolium castaneum 1 (eg) | NM_001114367.1 | NP_001107839.1 | |
Tribolium castaneum 2 (knrl) | NM_001128495.1 | NP_001121967.1 |
Gene . | Genomic Location or Transcript Code . | Predicted Protein . | Notes . |
---|---|---|---|
Acyrthosiphon pisum 1 | XM_003243072 | XP_003243120 | Corrected predicted sequence |
Acyrthosiphon pisum 2 | XM_003248946 | XP_003248994 | Highly similar to Acy. pisum3. |
Not used in further analysis | |||
Acyrthosiphon pisum 3 | XM_003244284 | XP_003244332 | |
Aedes aegypti | NW_001811303.1 | XP_001661506.1 | |
Anopheles gambiae 1 | XM_001230803.1 | AGAP010438 | Corrected predicted sequence |
Anopheles gambiae 2 | Chr. 3L 3413051 … 3429794 | No protein predicted | Our prediction |
Apis mellifera 1 | XM_001120531 | XP_001120531 | |
Apis mellifera 2 | XM_395932 | XP_395932 | |
Apis mellifera 3 | XM_001120662 | XP_001120662 | |
Bombyx mori | scaffold 316:325125 … 378308 | No protein predicted | Our prediction |
Culex quinquefasciatus 1 | supercontig 3.87:256000 … 304600 | CPIJ 004741 | Corrected predicted sequence |
Culex quinquefasciatus 2 | supercontig 3.517:186600 … 199900 | CPIJ013614 | Corrected predicted sequence |
Daphnia pulex 1 | scaffold 43: 250370 … 252140 | JGI_V11_290673 | |
Daphnia pulex 2 | scaffold 43: 295311 … 300650 | JGI_V11_290668 | Corrected predicted sequence |
Drosophila melanogaster kni | NM_079463.2 | NP_524187.1 | |
Drosophila melanogaster knrl | NM_176374.1 | NP_788552.1 | |
Drosophila melanogaster eg | NM_079482.3, NM_168938.1 | NP_524206.1, NP_730689.1 | |
Ixodes scapularis | NW_002720141.1 | ISCW008631 | Corrected predicted sequence |
Nasonia vitripennis | XM_001604919.2 | XP_001604969.2 | |
Pediculus humanus 1 | XM_002430215.1 | PHUM468660 | |
Pediculus humanus 2 | XM_002430218.1 | PHUM470190 | |
Pediculus humanus 3 | XM_002430217.1 | PHUM469580 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL563043:6194 … 6510 | RPTMP01442 | Corrected predicted sequence |
Rhodnius prolixus | supercontig GL555478:421390 … 596632 | No protein predicted | Our prediction |
Strigamia maritima | scf7180001248850:65900 … 68500 | Smar_temp_012614 | |
Tribolium castaneum 1 (eg) | NM_001114367.1 | NP_001107839.1 | |
Tribolium castaneum 2 (knrl) | NM_001128495.1 | NP_001121967.1 |
Alignments and Phylogenetic Analysis
The high conservation of the core sequence which contains the DNA-binding domain makes assigning specific genes to the knirps family very easy. It shows nearly 70% identity in the encoded residues within the 94 aa core sequence, between all sequences, and only about 10% of highly variable residues. Outside of this core sequence, the sequences are extremely variable, and precise alignments were more difficult.
An alignment of the full amino acid translations shows additional smaller conserved motifs consisting of a handful of amino acids in all the sequences. These were located in each sequence using the motif search tool in Geneious software package, allowing up to two mismatches. The most notable of the short motifs are what we have called the PIDLS-domain and the GASS-domain. The former appears twice towards the C-terminus in all gene copies (with the exception of D. melanogaster knirps, which has only one copy). The conserved aa sequence is P I/L/M D L S/T N K. The latter is encoded by a subset of nine genes. In seven cases it appears with the conserved sequence SDSGASSA, with up to one amino acid residue change, and in one case two amino acids are replaced. In one more case (Apis3), only five of eight amino acids are conserved, but as this is in a homologous position, we identify it as the same motif. Additional short stretches of sequence similarity were found among genes of closely related species (e.g., within the Diptera).
The results of the different phylogenetic analyses are presented in figure 2. While there are differences between the results obtained with different methods, there are several patterns that are consistent. Within the insects, there is a division of the knirps gene family into a number of distinct groups. A large assembly (marked in green in fig. 2) consistently comes out as a monophyletic group. This group includes the genes knirps-related (knrl) from D. melanogaster, and knirps from T. castaneum, and a number of genes also identified as knrl based on BLAST similarity. The group of genes marked in blue in the figure includes eg and genes identified as similar to it. These genes do not form a consistently supported group in all analyses, and several of the root branches of this group are not statistically supported at all. However, we note that all of these genes contain the GASS-domain, lending support to their inclusion into a monophyletic group. A third group (marked in magenta) includes sequences from hemipteroid insects (Pediculus humanus, Rhodnius prolixus, and Acyrthosiphon pisum) that do not have a GASS-domain. They tend to cluster together, but their position in the tree varies between analyses. D. melanogaster knirps and Apis mellifera 1 (marked in red in fig. 2) are highly divergent and come out at a different position in each analysis, usually with weak support. The genes from the non-insect outgroups (marked in black) do not cluster with the insect genes. The two copies from Dap. pulex and the two copies from P. hawaiensis cluster by species, and apparently do not represent orthologous pairs.

Phylogenetic trees of the protein sequences generated by different methods. (A) Bayesian tree. Numbers on nodes indicate consensus support among 203 generated trees. (B) Maximum parsimony tree generated by the Mesquite software package using default parameters. Numbers on nodes indicate consensus support among 53 most parsimonius trees. (C) Maximum likelihood tree generated by Phylip Server with default parameters. Numbers on nodes represent aLRT support for most likely tree. Nodes with support values <0.5 have been collapsed. The putative groupings are marked by specific colors. Blue: eagle group genes. Green: knirps-related group genes. Magenta: hemipteroid-specific duplication. Black: non-insect genes. Red: highly divergent lineage-specific duplications. Abbreviations as in figure 1.
Intron–Exon Structure
After determining the splice sites for each of the sequences, the number of exons and the position and length of the introns in each of the coding sequences was noted (fig. 3). The different knirps relatives contain two to four protein-coding exons. One of these is universally conserved in all of the genes found (marked with a bounding box in fig. 1). This exon is 78 bases long, and in most cases is the first coding exon of the gene. In five of the genes it appears to be the second, and there is a short exon preceding it. One of these is D. melanogaster knrl, where the long intron located between the small first coding exon and the second conserved exon prevents the gene from being transcribed during the embryo’s early development (Rothe et al. 1992). It is interesting to point out that in three of the insect genes where there is an additional first exon (Apis1, Phum1, and Amel1), this exon contains conserved short sequences, suggesting a common origin for the 5′ exon in insects. The universally conserved exon begins with a Methionine codon ATG (even when it is not the first protein-encoding exon), and ends with an A, which is the first base of a Lysine codon. The splice sites for this exon are always in the same position. Within the protein sequence encoded by this exon, only one amino acid position varies frequently between species. Four additional amino acid positions are encoded differently in one or two of the species.

The intron–exon structure of the 28 sequences used in the analysis. The boxes indicate the protein encoding exons and are drawn to scale. The introns are represented by angled lines, and the thickness of the lines is proportional to intron length (see bottom of figure). The core sequence is marked in blue in the exons. The conserved short motifs appear as colored bars in their correct position. Sequence logos for the three motifs appear by their legends. Poorly conserved motifs appear in a lighter color (light green for GASS-domain and orange for first PIDLS-domain).
The following intron–exon boundaries are highly variable, and there are no conserved splice sites. Several splice site prediction algorithms and automated annotations predicted an exon–intron boundary at the end of the conserved core sequence. We believe that this is an erroneous prediction in all cases, as the following sequence (predicted to be intronic) can be matched with the coding sequence in other organisms.
Manual versus Computerized Annotation of Large Datasets
For 17 of the 28 genes that were found, we adopted information from existing gene models. For the remaining 11 genes, we predicted intron–exon structures ourselves. This was partly because some of them came from genomes that had not yet been annotated at the time of data collection. However, in some cases we discovered that the computer-generated predictions were erroneous or partial and we corrected them manually. As noted above, in many cases, a presumably erroneous intron was predicted just after the end of the conserved core sequence. It is intriguing that the prediction algorithms failed so frequently in this group of genes, and specifically in one site. We suggest that there might be a conserved sequence that is similar to a splice site signal, but that upstream and downstream recognition sequences are missing, so that the site is not spliced in vivo. We also cannot rule out the possibility that this is indeed a site that is alternatively spliced.
In the past few years, there has been a dramatic increase in the number of sequenced genomes along with a large increase of raw sequence data, and this is expected to continue to rise. In our study, we found that the exclusive use of automatically annotated gene models was insufficient for precise gene locus analysis. We used manual curation to overcome erroneous predictions and to provide maximal reliability. Based on our experience, we conclude that a manual curation approach has obvious limitations with respect to time and work resources. However, supporting automated genome annotations with experimental or transcriptomic data may prove essential for more comprehensive studies.
The evolution of the knirps gene family
The phylogenetic analysis, together with the distribution of the GASS-domain, strongly suggests that the knirps family in insects fundamentally includes two ancestral genes, eg and knrl. These are probably a result of a single gene duplication event that occurred within basal insects, or in the lineage leading up to insects. The sequences found in the chelicerate and myriapod represent the ancestral state of the family with only one gene present, prior to the insect-specific duplication. The two copies found in Dap. pulex and P. hawaiensis most likely represent independent, lineage-specific duplication events, a suggestion strengthened by the fact that the two Dap. pulex genes are found in tandem on the same scaffold. However, we cannot rule out the possibility of a single duplication event in crustaceans followed by concerted evolution of the two sequences, as demonstrated for the engrailed gene duplicates in hexapods (Peel et al. 2006).
One of the two copies in insects acquired the short sequence encoding the GASS-domain, and this remains the signature sequence of all of the orthologs of the gene we propose should be called eagle (blue in fig. 2). This motif has not been described previously and is not found in any of the short motif databases we searched (Pfam, BLOCKS, PROSITE). However, a BLAST search within the D. melanogaster genome recovered at least 100 hits in which the motif is conserved to within 75%, showing that it is not uncommon. An analysis of putative function of the genes where the motif is found, using the DAVID online platform for GO term analysis (Huang 2009a, 2009b), shows that approximately 33% of the top 50 hits that have GO annotations are zinc finger proteins, while a smaller number are annotated as transcription factors of various kinds. The appearance of multiple polar and charged amino acid residues in this sequence, typical for DNA-binding domains, hints at a possible accessory role in DNA binding.
The second copy in insects, which is equivalent to the genes known as knirps-related (knrl—green in Fig. 2), seems to be more conservative and has more copy-specific conserved amino acid residues. Therefore, the monophyly of its orthologs is more consistently supported in different analyses. In contrast to eagle orthologs, knrl sometimes has an additional 5′ protein-coding exon, and it tends to have more introns than eagle. This gene may have undergone an additional duplication within the Paraneoptera (hemipteroids), as evidenced by the clustering of one copy of this gene in Ped. humanus, Acy. pisum, and R. prolixus (magenta group in fig. 2). Functional data on the Knrl group proteins versus Eg group proteins are insufficient to suggest a conserved functional difference between the two genes, nor are there sufficient data regarding the hemipteroid-specific duplicates.
Within the lineage leading to Drosophila, the knrl gene underwent a second duplication. One of the copies then diverged significantly from the parent sequence. This highly divergent copy is the gene that was first discovered and given the name knirps based on its mutant phenotype in the classic mutagenesis screen by Nüsslein-Vollhardt and Wieschaus (1980). Ironically, the gene that gave its name to the family is the most atypical. The most noticeable divergence is that it is missing the second PIDLS-motif. The two copies in Drosophila have undergone subfunctionalization, with one copy (kni) being more active in the early blastoderm. In this respect, it is interesting to note that the early blastodermal expression domain in the prospective abdomen is not only earlier due to the lack of a large intron (Rothe et al. 1992) but also has an independent and different regulation than the respective expression domain of knrl (Rothe et al. 1994). Despite the sequence divergence of the fast-evolving gene kni, it is still functionally highly redundant with the more conserved gene (knrl), as the embryonic head/trachea/gut phenotype is only seen when both gene functions are removed (Gonzalez-Gaitan et al. 1994; Chen et al. 1998; Fuß et al. 2001). Similarly, mutation of a shared enhancer element is needed to knock down both genes during wing development (Lunde et al. 2003). Thus, kni and knrl in flies might represent a classical situation of a gene duplication event, in which one gene copy keeps the original function, whereas the second is still redundantly keeping most of this function while diverging to new functions (Ohno 1970).
The honeybee, Api. mellifera, also contains an extra copy of the knrl gene. This gene labeled Amel1 has an inconsistent position in the tree, making it difficult to identify its orthology. This is largely due to the existence of an unusually large first exon. We suggest that in Api. mellifera there has been an event similar to that in D. melanogaster where a lineage-specific duplication led to an additional highly divergent copy of the gene. In order to test the effects of the two divergent genes on the phylogenetic analysis, we have carried out two additional tests. In one, we ran the Bayesian analysis with identical parameters, but without the two divergent sequences. This analysis resulted in a tree that is similar in topology, but resolves the knrl versus eg split better, suggesting that at least some of the anomalies were due to long-branch attraction to the two divergent sequences (supplementary file S5, Supplementary Material online). In the second, we excluded the first exon (present in five sequences, including Amel1) and the final PIDLS-motif (missing in Dmel-kni). This evidently removed too much informative data, and the resulting tree mostly collapsed to many polytomies (not shown). However, it is worth noting that Dmel-kni clustered with weak support with the group of knirps-related genes.
Figure 4 shows a suggested reconstruction of the evolutionary history of the knirps gene family, summarizing the discussion above. We suggest a major split into two groups within the insects, followed by lineage-specific duplications and losses. In the non-insect clades, we can identify two distinct duplication events within the crustaceans.

Reconstruction of the evolutionary branching events and gene duplications that created the knirps family genes found in our analysis. Gene duplication events are marked with angled gray lines. Genes that we believe are not found due to secondary loss appear grayed out in the tree.
Our analysis of the evolution of the knirps family of genes presents a hypothesis of molecular evolution that is congruent with arthropod phylogeny. Similar analyses at a larger scale require significant improvements in automated gene predictions, for example, by taking advantage of transcriptome data, for non-model species. The rapid advances in sequencing technology need to be followed by a similar advancement in analytical tools, making such analyses possible for a wide range of gene families in different phyla.
Materials and Methods
Building a Dataset
In order to accumulate the initial dataset, we used several resources. In all methods, we used the core sequence of the Knirps protein in D. melanogaster as the query sequence (see fig. 1) and ran a tblastn query (Altschul et al. 1997). Initially we used NCBI’s genome BLAST against the available species of arthropods. These were fully annotated genomes at the time of the analysis and so the extraction of the genomic, mRNA, and protein sequences for each organism was straightforward. These searches were conducted for the following species: D. melanogaster, Api. mellifera, Nasonia vitripennis, Acy. pisum, and T. castaneum. It should be noted that there are additional Drosophila species with fully annotated genomes but these were not used, since the aim was to examine a broad phylogenetic spectrum.
Additional sources of information were various genome project websites (table 1), which often offer complete assemblies or draft assemblies of that organism’s genome. These genomes had yet to be annotated and therefore were not included in NCBI’s databases at the time the first search was done (2009–2010), and prediction of mRNA and protein products had yet to be published. In order to try and predict the product of these genes, we extracted the sequence that includes the conserved area of the protein. These searches were conducted for the following organisms: Aedes aegypti, A. gambiae, Culex quinquefasciatus, I. scapularis, Ped. humanus, Bombyx mori, S. maritima and R. prolixus. Sequences for Dap. pulex were taken from Thomson et al. (2009). Sequences for S. maritima were obtained as part of the manual annotation process for this species and are used by permission (Richards S, personal communication)
When more than one copy of the gene was found in any organism, the genes were numbered in the order they were discovered. The T. castaneum genes were annotated according to their D. melanogaster orthologs (eg and knrl) in their NCBI entries and so they are marked with these names as well. In some cases, BLASTing of an annotated genome showed additional copies of the gene, which were either not predicted or represented incorrect predictions, containing only a part of the conserved core. In these instances, the sequences used were the ones extracted from the unannotated genome and processed as the other predicted sequences. We found three sequences in the genome of the pea aphid Acy. pisum, however two of these (Apisum1 and Apisum2) are nearly identical at the nucleotide level (with one copy missing one exon), and we suspect that there may be a sequencing or assembly error. Therefore, we only used Apisum1 and Apisum3 in further analyses.
For the amphipod P. hawaiensis, knirps1 (kni1) and knirps2 (kni2) were isolated via degenerate polymerase chain reaction followed by Rapid Amplification of cDNA Ends (RACE). To this end, polyA RNA was isolated from a pool of mixed-stage Parhyale embryos using the MicroPoly(A)Pure™ Kit (Ambion). First and double-strand cDNA was generated using the SMART RACE cDNA Amplification Kit and the SMART PCR cDNA Synthesis Kit (Takara Clontech), respectively. Novel Parhyale knirps sequences were recovered by degenerate PCR with primers based on aligned arthropod Kni, Knrl, and Eg protein sequences. Initial sequence information from degenerate PCR was used to perform 5′ and 3′ RACE PCR (SMART RACE cDNA Amplification Kit, Takara Clontech), using nested reverse and forward primers. The kni1 and kni2 cDNA sequences derived from 5′ and 3′ RACE fragments were verified via long-range PCR (Advantage® 2 PCR Enzyme System, Takara Clontech). See supplementary methods, Supplementary Material online, for primer details. The Genbank accession numbers are JQ841040 for Parhyale kni1 and JQ841041 for Parhyale kni2.
Splice Site Identification
For the genes obtained from unannotated genomes, it was necessary to identify splice sites in order to determine which areas of the gene are translated. Initially we attempted to combine results from several splice site prediction servers but recovered numerous errors and false positives (see supplementary methods, Supplementary Material online, for details). Thus, we ultimately decided to manually determine splice sites by considering the results of splice prediction algorithms together with exon–intron boundaries in homologous sites indicated in the genes from fully annotated genomes. We refer to these sequences with manually determined splice sites as predicted sequences (see table 2), since they have no experimental verification. The alignments of DNA to mRNA for all genes were carried out using EBI's Align (Needleman and Wunsch 1970) and NCBI's Spidey (Wheelan et al. 2001), with some manual adjustments. When using Align, we used the needle algorithm while setting the maximum penalty for gap opening and the minimum penalty for gap extension.
For some of the less well-annotated genomes, gene predictions have been done since we completed our own preliminary analysis. We compared our analysis with the predictions and noted discrepancies. In many cases, the predictions were missing exons that we uncovered manually, whereas in other cases, our manual work missed exons or exon–intron boundaries that were predicted in automated annotations (see table 2 for full list).
Using Parhyale kni1 and kni2 cDNA sequence (see above), BAC clones covering the genomic kni1 and kni2 loci were identified from a Parhyale BAC library (Parchem et al. 2010). Sequence determination of one BAC clone covering the kni1 locus (#249P10) and three BAC clones covering the kni2 locus to varying extent (#152L23, #213I10, and #218E14) was outsourced for 454 sequencing and assembly to the Göttingen Genomics Laboratory (G2L). By aligning cDNA and genomic sequences using Clustal Omega (Goujon et al. 2010) and manual curation, intron–exon boundaries were identified (for more details, see supplementary methods, Supplementary Material online). Accession numbers for genomic sequences are KC808167 (kni1) and KC808168 (kni2).
Alignments and Phylogenetic Analysis
Amino acid alignments were conducted using ClustalW2 as implemented in the Geneious software package (Biomatters, New Zealand) with default parameters. The alignment was then improved manually, and uninformative residues (long stretches with no homology within any pair of sequences) were removed. The final alignment included 628 characters for 28 sequences (see supplementary methods, Supplementary Material online).
The protein sequence alignments were then used to construct phylogenetic trees using several different phylogenetic tools. For Bayesian analysis, we used the MrBayes (Huelsenbeck and Ronquist 2001) plugin for the Geneious software. Model selection was done using ProtTest server (Abascal et al. 2005). The analysis was run using the VT + invgamma model, with a chain length of 100,000 and a burn-in length of 10,000. Maximum likelihood analysis was carried out using the PhyML web server (Guindon and Gascuel 2003; Guindon et al. 2010), using the LG substitution model with discrete gamma and four categories. The resulting tree presents approximate likelihood ratio test (aLRT) support for the most likely tree (Anisimova and Gascuel 2006). Maximum parsimony analysis was carried out using the heuristic search algorithm in Mesquite 2.7.2 software package. A consensus tree of 53 most parsimonious trees was generated using the Geneious package.
Acknowledgments
The authors thank Ernst Wimmer and Liran Carmel for comments on the manuscript, and Nurit Doron for discussions on protein structure. The manuscript was improved by comments from three anonymous referees. The authors also thank Nipam Patel for providing the spotted Parhyale BAC library. Sequences from the Strigamia maritima genome are used by permission of the Baylor College of Medicine Genome Center and the Strigamia Sequencing Consortium. Work of B.G.M.S. and J.S. was supported by the European Community’s Marie Curie Research Training Network ZOONET under contract MRTN-CT-2004-005624 (to Ernst A. Wimmer) and the Boehringer Ingelheim Foundation (to Ernst A. Wimmer). Work of T.N.P. and A.D.C. was supported by an Israel Science Foundation grant #240/08.
References
Author notes
Associate editor: John True