Genomic Features of the Damselfly Calopteryx splendens Representing a Sister Clade to Most Insect Orders

Insects comprise the most diverse and successful animal group with over one million described species that are found in almost every terrestrial and limnic habitat, with many being used as important models in genetics, ecology, and evolutionary research. Genome sequencing projects have greatly expanded the sampling of species from many insect orders, but genomic resources for species of certain insect lineages have remained relatively limited to date. To address this paucity, we sequenced the genome of the banded demoiselle, Calopteryx splendens, a damselfly (Odonata: Zygoptera) belonging to Palaeoptera, the clade containing the first winged insects. The 1.6 Gbp C. splendens draft genome assembly is one of the largest insect genomes sequenced to date and encodes a predicted set of 22,523 protein-coding genes. Comparative genomic analyses with other sequenced insects identified a relatively small repertoire of C. splendens detoxification genes, which could explain its previously noted sensitivity to habitat pollution. Intriguingly, this repertoire includes a cytochrome P450 gene not previously described in any insect genome. The C. splendens immune gene repertoire appears relatively complete and features several genes encoding novel multi-domain peptidoglycan recognition proteins. Analysis of chemosensory genes revealed the presence of both gustatory and ionotropic receptors, as well as the insect odorant receptor coreceptor gene (OrCo) and at least four partner odorant receptors (ORs). This represents the oldest known instance of a complete OrCo/OR system in insects, and provides the molecular underpinning for odonate olfaction. The C. splendens genome improves the sampling of insect lineages that diverged before the radiation of Holometabola and offers new opportunities for molecular-level evolutionary, ecological, and behavioral studies.


A) Supplementary Text Genome assembly and annotation
Contig assembly was performed using SparseAssembler (Ye et al. 2012), with parameters "K 51 g 20 GS 2000000", where K is the k-mer size, g is the number of skipped k-mers and GS is the estimated genome size in Kbp. Scaffolding was then performed on contigs >200 bp, using SSPACE (Boetzer et al. 2011) with default parameters. A combination of different parameters was tested at both the contig assembly and scaffolding steps, in order to get the best assembly. The results of this "parameter scan" were evaluated using the BUSCO (Benchmarking Universal Single-Copy Orthologs) pipeline with the arthropod data set (Simao et al. 2015), in order to find the best possible assembly. The last step in genome assembly was to remove short scaffolds. Using BUSCO, we filtered out scaffolds <15 Kbp; removing larger scaffolds resulted in losing conserved arthropod genes. This resulted in an assembly consisted of 8,896 scaffolds, which is a relatively easy to work with, while the assembly size (1.63 Gbp) is also very close to the estimated genome size (1.7 Gbp). However, less conserved and/or short genes can possibly be found in the remaining 430,431 short (<15 Kbp) scaffolds. Therefore, we also made them available in our website for the scientific community. Preliminary analysis of these contigs showed that their vast majority does not contain a significant hit against the SwissProt database. More specifically, 21,871 (5.1%) short scaffolds have a significant (e-value <1e-05) match against a SwissProt entry, excluding transposable elements. Furthermore, of those 21,871 scaffolds, only 1,369 (0.3%) cover a >60% of the corresponding SwissProt entry.
It should be noted that reads from long insert libraries were only used for scaffolding, but not for contig assembly. The reason for this is because long insert libraries do not sample equally well the entire genome and this unequal sequencing coverage could lead to assembly errors and/or fragmentation. In fact, the few attempts where both short and long insert libraries were used for contig assembly, resulted in assemblies with a much lower contig N50. Moreover, these runs required more computational resources and also more runtime, such that it was not practical to perform the above-mentioned parameter scan.
Genes were annotated with the MAKER pipeline v. 2.31.8 (Campbell et al. 2014) and resulted in a gene set containing 22,523 genes. The evidence used were (a) Calopteryx splendens transcripts obtained from 1KITE, (b) arthropod proteomes from OrthoDB v8 , and (c) the SwissProt protein database (Bairoch et al. 2004). Functional annotation was performed using InterProScan (Jones et al. 2014) for finding conserved domains and BLASTP (Camacho et al. 2009) against Uniref50 (Suzek et al. 2015) for finding conserved functions.
For assembling the above mentioned C. splendens transcriptome, one RNASeq library was prepared from whole body RNA extracts of pooled males and females. Sequencing resulted in 13,095,991 read pairs, or 3.9 Gbp of raw sequence data. Transcriptome assembly was done using SOAPdenovoTrans and generated 101,092 sequences (transcripts) comprising 39 Mbp.

Identification of contamination
We used two strategies to assess the occurrence of bacterial contamination in our libraries: by scanning (a) the reads, and (b) the predicted genes of the assembly for significant similarity to bacterial sequences. Such similarity would mean that the given damselfly sequence (read or gene) likely represents a bacterial contaminant. Alternatively, it could also mean that these sequences were acquired by lateral gene transfer (LGT), especially since there is an increasing number of studies that has documented such LGT events in various arthropods and other animals (Robinson et al. 2013).

Scanning reads
Reads from all eight libraries (four short-insert and four long-insert) were searched against the NCBI NT database using BLASTN. We found 549,172 reads (of a total ~2.6 billion reads) having a bacterial best match. Several attempts were made to assemble these reads but without success; we could only get very short contigs that very rarely contained whole genes. This finding suggested that there is very low contamination with bacterial reads. As a result, we decided to not remove any bacterial-like reads from our data set, especially since some of them could be the result of LGT.

Scanning predicted genes
The amino acid sequence of all 22,523 predicted damselfly genes was searched against the NCBI NR database, using BLASTP. There were only 50 genes in this C. splendens predicted gene set whose first BLASTP hit referred to bacteria. These sequences were further examined to determine if it is more likely to represent bacterial contaminants or potentially laterally transferred bacterial genes. To this end, a number of features was extracted for each of these genes, which are shown in table S6. As can be seen in this table, a considerable number of genes shows high similarity to Wolbachia bacteria. Wolbachia bacteria are common insect endosymbionts that have frequently transferred parts of their genome, or even its entire genome, to the nuclear genome of their arthropod host (Robinson et al. 2013). These genes are good LGT candidates and it would be interesting to verify their ancestry in future studies.
It should be noted what we also scanned our reads for possible contamination by sequences from gregarine parasites, which are known to infect Odonata (Cordoba-Aguilar and Cordero-Rivera 2005; Stoks and Cordoba-Aguilar 2012). First, we searched the reads using the NCBI NT database and found that there were only 1,069 reads matching some gregarine parasite. Second, we also searched the assembled scaffolds using the same nucleotide database (NCBI NT). No scaffolds, however, were had a best hit to a sequence originating from a gregarine parasite. The results from these two searches strongly suggest that gregarine contamination is negligible, both in the reads and also in the assembled scaffolds.

Phylogenomics and orthology
The predicted gene set was mapped against all arthropods in OrthoDB v8  and assigned to ortholog groups (OGs). Subsequently, a phylogenomic analysis was undertaken using OGs that had exactly one ortholog (i.e. single-copy orthologs) in each of the following species: Daphnia pulex (water flea), Zootermopsis nevadensis (termite), Pediculus humanus (body louse), Acyrthosiphon pisum (pea aphid), Apis mellifera (honey bee), Tribolium castaneum (red flour beetle), Danaus plexippus (monarch butterfly) and Drosophila melanogaster (fruit fly). In addition, the BUSCO pipeline was used in order to extract singlecopy orthologs from the transcriptome assembly of the azure damselfly, Coenagrion puella (Johnston and Rolff 2013) and the blue-tailed damselfly, Ischnura elegans (Chauhan et al. 2014). For studying the latter, we assembled its transcriptome from the deposited raw reads in SRA, since there was no publicly available assembly. The assembly was done using Trinity (Haas et al. 2013) with the default parameters.

Protein families
The protein sequence of C. splendens predicted genes was clustered with those of another seven insect species, using blastclust 2.2.9 from the BLAST+ package (Camacho et al. 2009), with a length coverage threshold >60% on both genes, and a percentage of identities threshold >35%. The seven insects used for this clustering scheme were: Z. nevadensis, P. humanus, A. pisum, A. mellifera, T. castaneum, D. plexippus and D. melanogaster. Families of interest with regard to this study, such as detoxification and immunity-related genes, chemoreceptors, and opsins were subsequently studied in depth. Other families, such as the β-arrestin family (see the section on arrestins below), that were seemingly over-or underrepresented in C. splendens were also further studied. Finally, for identifying over-represented InterPro entries, these were extracted from the InterProScan results and filtered to keep entries covering >75% of the corresponding Pfam hidden Markov model (HMM) (v3.1b1).

Detoxification enzymes
Sequences encoding CYPs, GSTs, and CCEs were identified by searching for the corresponding InterPro domains (CYPs: IPR001128; GSTs: IPR003081, IPR005442,  IPR003080, IPR003082; CCEs: IPR019819, IPR002018, IPR019826) in the InterProScan result file of the predicted protein set, as well as by BLAST (Camacho et al. 2009) searches using already known proteins for each superfamily from other insects species. A first manual analysis of the retrieved predicted peptides was performed with BLASTP searches against NCBI NR and Uniprot/SwissProt protein database. The predicted genes were then visualized and manually edited, when necessary, in the genome browser WebApollo (Lee et al. 2013). In some cases, when short sequences with partial domains were found, we attempted to retrieve more complete sequences using GeneWise (Birney et al. 2004). Manually curated sequences with at least 300 amino acids for CYPs/CCEs and 170 amino acids for GSTs were used for subsequent phylogenetic analyses. These length cutoffs were chosen based on the average length of proteins in the protein family. Predicted proteins were aligned with MAFFT (Katoh and Standley 2013), with the default parameters, to the corresponding proteins from D. melanogaster. For the CYP phylogeny, additional CYPs from Paracylopina nana (Copepoda, Cyclopoida) (Han et al. 2015) were added to the analysis in order to better resolve nodes. A maximum-likelihood phylogeny was created for each superfamily using RaxML (Stamatakis 2006), using the PROTGAMMAAUTO model and performing 100 bootstrap replicates. The trees were visualized and drawn with Evolview (He et al. 2016) and Inkscape v0.91. SignalP (Petersen et al. 2011) and TMHMM (Krogh et al. 2001) were used to infer subcellular localization and presence of transmembrane helices.
C. splendens orthologs of all immune signaling pathway members were identified, except for the death domain-containing gene tube, which in D. melanogaster interacts with pelle and Myd88 (Towb et al. 2009). Representatives of all gene families were also identified, and these genes were checked for the proportions of the InterPro domain profiles that they matched to reach a more conservative count of genes with domains that match more than 75% of the corresponding profile. Pfam profiles were used unless the domain had no corresponding Pfam profile, in which case Superfamily (LYSs & HPXs), SMART (PGRPs), and PANTHER (SODs) were used. In general, C. splendens immune-related gene families are not especially larger or smaller than those of other insects, although families of CASPs and PGRPs appear to have expanded. Only one GNBP and one DEF gene were identified, but their presence suggests that C. splendens is capable of GNBP and DEF-mediated immune responses even if some other arthropods have more of these types of genes, particularly of GNBPs. The immune gene catalogue is therefore generally complete, suggesting that C. splendens is capable of mounting robust immune responses to a variety of different pathogens and parasites.
To examine the PGRPs in detail, the regions corresponding only to the matched PGRP domains from each protein were first extracted. Initial phylogenetic analyses with RAxML on MAFFT amino acid sequence alignments clearly distinguished between 'shared' (more closely-related to those from other insects) and 'specific' (only found in C. splendens) domains. Thus, to estimate the PGRP-domain maximum likelihood phylogeny with domains from C. splendens, D. melanogaster, and A. mellifera, MAFFT alignments were first built for the shared and specific sets separately, and these alignments were then combined with MAFFT's merge function. MAFFT alignments were performed with default parameters and RAxML phylogenies were built with 100 bootstrap replicates. To confirm that such multidomain PGRP domains were indeed not previously found in any other known species,  The details of the twelve resulting matches,  along with comments on the reliability of their annotation, are presented in table S8. Several matches were to proteins that have already been withdrawn (status inactive) and replaced with corrected annotations that no longer have multiple PGRP domains. For example, several fly proteins, each with three domains, are in fact incorrect gene annotations of PGRP-LC that have combined the three domains that in D. melanogaster are mutually exclusive alternative transcripts, and a 4-domain monkey protein results from an incorrect gene annotation that fuses the neighboring PGLYRP3 and PGLYRP4. This comprehensive search revealed that no other organism with available protein sequences has genes annotated with more than four PGRP domains encoded by a single gene.

Chemoreceptors
Although odonates had long been considered to be largely anosmic, relying primarily on visual and tactile stimuli for feeding and mating (e.g. Corbet 1980;Crespo 2011), recent studies have revealed diverse chemosensory capabilities (Rebora et al. 2012(Rebora et al. , 2013(Rebora et al. , 2014Piersanti et al. 2014aPiersanti et al. ,b, 2016Frati et al. 2015Frati et al. , 2016. Three large families of chemoreceptors mediate most of the specificity and sensitivity of olfaction and taste in insects, the Odorant Receptor and Gustatory Receptor families of seven-transmembrane ligand-gated ion channels (Benton 2015;Joseph and Carlson 2016), which are distantly related to each other in the insect chemoreceptor superfamily now known to be present even in basal animals (Robertson et al. 2003;Saina et al. 2015;Robertson 2015), and the unrelated threetransmembrane Ionotropic Receptors, which are variants of the ionotropic glutamate receptors also widespread in animals (Rytz et al. 2013). With the exception of two kinds of coreceptors (OrCo in the ORs and Ir8a, 25a, and 76b in the IRs) and a few other GRs and IRs, most of these receptors evolve rapidly and are highly divergent both across orders of insects and across each receptor family. Gene models for them are usually therefore not well built by genome-wide automated gene modeling, unless supported by deep transcriptome sequencing of relevant chemosensory and other tissues. Indeed, the automated annotation for C. splendens has only partial models for OrCo, the single sugar receptor GR1, the three IR co-receptors, and four more IRs.
We therefore undertook an exhaustive manual annotation of these three chemoreceptor families. TBLASTN searches with e-values up to 100,000 (Altschul et al. 1997;Camacho et al. 2009) with query proteins from the termite Z. nevadensis, a non-holometabolan insect with a complete manually annotated set of chemoreceptors (Terrapon et al. 2014), and many receptors from other insects, were performed on both the main genome assembly and the excluded short scaffold set. Gene models were built manually in the text editor TextWrangler, which accommodates up to 18kb of sequence on a single line, which is important because genes in large genomes commonly have long introns. Splice predictions were obtained from the Splice Prediction by Neural Network (Reese et al. 1997) webserver at the Berkeley Drosophila Genome Project (http://www.fruitfly.org/seq_tools/splice.html), although it does not recognize variant GC donor sites, five of which were invoked to build suitable models. Relevant gaps in the assembly were repaired using raw reads from the four 550bp shotgun libraries available in the Short Read Archive at NCBI, and many of them simply required collapsing unresolved flanking direct repeats (negative gaps). Occasionally models were created that spanned two scaffolds, based on the appropriateness of connecting them across scaffolds. A few pseudogenes were included in the gene set, when they could be built to encode more than 50% of the family length without disrupting the alignments too badly, and were translated as best possible to encode an alignable protein (using Z for stop codons and X for all other pseudogenizing mutation such as frameshifting indels and mutated intron splice junctions). Iterative TBLASTN searches were performed with all newly identified chemoreceptors in an attempt to exhaustively find all members of each family. Gene models were refined in light of repeated multiple alignments of each family.
The final multiple alignments for each family, along with representative receptors from Z. nevadensis (removing all pseudogenes, most partial proteins, and some closely related ones), as well as D. melanogaster and other insects when relevant, were performed with CLUSTALX v2.1 (Larkin et al. 2007), which is particularly good at aligning these highly divergent proteins through alignment of their transmembrane domains. Poorly aligning and gapped regions were removed using Trimal v1.

The Gustatory Receptor (GR) family
There are 51 GR genes in the C. splendens genome, coding for 115 proteins. It is hard to be confident that all of the divergent GRs in a genome have been identified, especially since there are no GRs from closely related insect species that could greatly increase the sensitivity of our searches. Moreover, the relatively well-conserved C-terminal region is split across three short exons, which makes it difficult to find divergent members of the family using TBLASTN searches. PSI-BLASTP searches of the automatically annotated proteins from a genome will sometimes reveal divergent GRs not identified by TBLASTN searches, however in this case only one of the identified GRs is partially annotated, so this is not a useful approach. We attempted to perform exhaustive TBLASTN searches using as query the protein sequences encoded by the last two exons of each identified GR with LQ before them (to represent the last six positions of a consensus splice acceptor) and VS after the penultimate one (to represent the first six positions of a consensus splice donor), with E=1,000,000, but found no additional candidate GRs. TBLASTN searches with E=100,000 with the 58 GRs from D. pulex also failed to identify any additional GRs. We therefore believe we have identified most, if not all, of the GRs in this genome, although many fragments, most recognizable as pseudogenic, related to the named genes remain and some might represent intact genes.
For the phylogenetic analysis we first tested inclusion of the GRs reported by Missbach et al. (2014) from their transcriptome analysis, however most are too short or align too poorly to be usefully included. We therefore included only the putative full-length GRs for the bristletail and firebrat. Representatives of the sugar receptor clade from D. melanogaster and Z. nevadensis, the carbon dioxide receptor clade from Anopheles gambiae and Z. nevadensis, and representative members of the fructose receptor clade from diverse insects were included. A large clade of intronless GRs from Z. nevadensis was excluded as C. splendens has no close relatives of them. The tree was rooted with the sugar and carbon dioxide subfamilies because these are the most distinctive and conserved subfamilies in the insect GR family ( Figure S7).
There is a single relative of the conserved subfamily of sugar receptors of insects, named CsplGr1, and it is the only GR for which there is at least a partial model in the automated gene set (CSPLE_14204). The D. melanogaster sugar receptors appear to function as dimers (Fujii et al. 2015), and all other examined insect genomes encode at least two candidate sugar receptors, e.g. AmelGr1/2 in the honey bee (Robertson & Wanner 2006) and ZnevGr1-6 in Z. nevadensis (Terrapon et al. 2014), so it is unclear how CsplGr1 in C. splendens might function as a sugar receptor. Freeman et al. (2014) reported that single sugar GRs expressed in an "empty" olfactory sensory neuron mediated appropriate responses to sugar, while Jung et al. (2015) found that AmelGr1 alone is responsive to sugars, but the dimer is more sensitive, so a single sugar GR could suffice. Missbach et al. (2014) found a member of this subfamily in the bristletail (LsigGr2) showing that the sugar receptor subfamily is at least that old, and the crustacean D. pulex also has members of it (Penalva-Arana et al. 2009), so it predates insect evolution.
CsplGr2 is similarly a single relative of the expanded subfamily of GRs related to the three carbon dioxide receptors of flies and some other Holometabola that also function as dimers ), but this subfamily is expanded in the bedbug Cimex lectularius (Benoit et al. 2016) and in Z. nevadensis (Terrapon et al. 2014), and it is not clear what ligands they recognize. Therefore CsplGr2 cannot be designated as a carbon dioxide receptor, although we note that odonates are capable of detecting carbon dioxide, involving inhibition of olfactory sensory neurons in coeloconic sensilla , however this response might be mediated by ionotropic receptors (see below). Nevertheless, CsplGr2 clearly represents the GR lineage from which the sensitive carbon dioxide receptors of Holometabola evolved, and indicates the antiquity of this lineage in insects. Missbach et al. (2014) did not find a clearcut member of this subfamily in their three insects, but since they worked from transcriptomes, albeit deep ones, it is always possible that members of this subfamily reside in zygentomans and/or archaeognathans. The subfamily has not been detected outside of insects.
CsplGr3-6 are highly divergent proteins that were discovered as weak matches in searches with the DmGr43a protein and its relatives in other insects. This protein functions as a fructose receptor both in the sensory periphery and the brain of Drosophila (Miyamoto and Amrein 2014). It has multiple relatives in various other insects, but relatives were not identified in the termite (Terrapon et al. 2014). In the tree, along with TdomGr4, this clade clusters near, but not confidently with, the fructose receptor clade, so it remains uncertain where they and TdomGr4 represent the origins of this fructose receptor subfamily in basal insects (see also Figure 4 in Missbach et al. 2014). This subfamily has also not been detected outside of insects.
The remaining 108 GRs form an expanded and species-specific clade confidently related to the intron-containing divergent GRs of the termite ( Figure S7). They have features common to most GRs in other insects and indeed other arthropods (e.g. Robertson et al. 2003;Robertson 2015). First, their genes have the ancestral structure of a long first exon that encodes transmembrane domains 1-6 followed by three short exons separated by three phase-0 introns in locations shared across most GRs, encoding intracellular loop 3, transmembrane domain 7, and the extracellular C-terminus. There are only a few exceptions to this structure in that some genes have idiosyncratically acquired novel introns that interrupt this usually long first exon (Gr6 has a phase 2 intron, Gr10 a phase 0 intron, Gr11/12 independent phase 1 introns, Gr17 a phase 2 intron, and Gr29 a phase 0 intron, and importantly none of these genes are parts of the alternatively-spliced genes described below, where such an intron interrupting the first long exon would not be compatible with the alternatively-spliced model). Second, the transmembrane 7 domain includes the only reasonably well-conserved and signature region of the GRs (the 7_tm7 family in Conserved Protein searches at NCBI, which they all find), the TYhhhhhQF motif (where h is any hydrophobic amino acids), although commonly it is THhhhhhQF with a few more unusual modifications, e.g. ANhhhhhQF in Gr7. Third, many of them exhibit an unusual form of alternative splicing in which multiple first exons are spliced to a set of these final three exons, sometimes generating large numbers of GRs that share their C-termini but differ in transmembrane domains 1-6, implying that these mediate the specificity of ligand-binding (Gr4a/b is similarly modeled as being alternative-spliced). While we have no RNAseq data to support these alternatively-spliced models, they are so similar to ones highly supported in some other insects that they are clearcut. However, the fragmented nature of the genome assembly, perhaps sometimes caused by the similarity of many of these long first exons, resulted in several scaffolds containing only a few first exons. In each case, however, these could be confidently assigned to one of the alternatively-spliced loci. The largest of the alternatively-spliced loci is Gr51a-u, with 21 first exons (and many pseudogenic fragments, see below). Fourth, the five large alternatively-spliced loci (Gr47-51) largely form separate clusters in the tree, as expected since they presumably result from tandemly-arranged expansions of the first exon through unequal crossing-over, and share their C-terminal regions. Fifth, these alternatively-spliced loci contain several pseudogenic first exons, consistent with rapid ecologically-relevant evolution of chemoreceptors. Only those encoding at least 50% of a typical GR were included, and many fragmentary pseudogenic remnants are present in the five largest alternatively-splice loci, and a few elsewhere in the genome. Fifteen such pseudogenic constructs were included in the named protein set, so the number of potentially functional GRs is 100. It is worth noting that none of these pseudogenes had only a single stop codon, indeed most had multiple pseudogenizing mutations including frameshifts and intron boundary mutations, hence they could not be pseudo-pseudogenes (Prieto-Godino et al. 2016). While the functions of these GRs is unknown, they share the features of bitter taste receptors in other insects, and are presumably involved in mediating much of gustation in the contexts of feeding and oviposition in this damselfly.

The Odorant Receptor (OR) family
The OR family in most insects consists of a single highly conserved protein that serves as a co-receptor with all the remaining ORs, known as the Odorant receptor Co-receptor or OrCo. The remaining ORs are known as "specific" ORs as they confer the specificity of ligandbinding to the dimer. OrCo was partially modeled as CSPLE_00492, however no specific ORs were found in the automated models. Searches of the assembly with ORs from Z. nevadensis uncovered a single specific OR in the assembly, called CsplOr1, encoded by a 6-exon gene.
Searches of the short scaffolds excluded from the main assembly revealed at least three more genes split across multiple short scaffolds, which were successfully connected to yield three 6-exon genes (CsplOr2-4) with evidence that the latter two are in a tandem arrangement. CsplOr2-4 share 42-49% amino acid identity and have 25-27% identity with CsplOr1. There are multiple short scaffolds with sequences similar to parts of these ORs and examination of the read depth for these three genes indicates that they probably represent 2-3 closely-related genes each. Finally, a single severely degraded pseudogene distantly related to these was noted, but could not be built sufficiently to include in the analysis. These proteins are unequivocally members of the OR family because their gene structure is similar to other ORs, especially in sharing three phase-0 introns in the same locations near the Cterminus (see above for GRs too), and all recover the 7tm_6 family in the NCBI Conserved Protein search, which is the OR family in insects. They are nevertheless highly divergent, sharing just 25-27% identity over their C-terminal two thirds, with their closest OR relatives in the GenBank non-redundant protein database, while an iteration of PSI-BLASTP searches yields full-length alignments with ~20% identity. Thus C. splendens has at least four specific ORs. This low number of ORs is consistent with the known reduced olfactory capacities of odonates, but it remains unclear why they do not appear to have glomerular antennal lobes and mushroom body calyces usually involved in transmission of olfactory signals, structures present in the older firebrat (Farris 2005). Robertson et al. (2003) speculated on the basis of a tree of the insect chemoreceptor superfamily of ORs and GRs from D. melanogaster that the OR family might have evolved from a lineage of GRs near the base of the Insecta, perhaps in conjunction with the evolution of terrestriality. Missbach et al. (2014), however, could not identify OrCo or specific ORs in extensive transcriptomes of a wingless archaeognathan, the bristletail Lepismachilis ysignata, but discovered three OrCo-like proteins, but no specific ORs in another wingless insect, the firebrat Thermobia domestica (Zygentoma), which most insect phylogenies indicate is a slightly more recent branch in the tree. They concluded that OrCo, at least, had evolved within insects, with specific ORs evolving after these wingless orders, perhaps by the Palaeoptera. Our finding of both a single OrCo and at least four specific ORs in this odonate indicates that the complete OrCo/OR system had indeed evolved by the time of the Palaeoptera.
To examine the relationships of these ORs further, our phylogenetic analysis included the three T. domestica OrCo (TdomOrCo) proteins, a representative set of OrCo proteins from other insects, three ORs from a phasmatodid Phyllium siccifolium also generated by Missbach et al. (2014), and a representative subset of the 69 ORs in Z. nevadensis (Terrapon et al. 2014). The OrCo proteins were declared the outgroup to root the analysis based on the intermediate position of this protein at the base of the OR family and nearer the GRs in analysis of the insect chemoreceptor superfamily (e.g. Robertson et al. 2003). The resultant tree shows the confident and phylogenetic appropriate clustering of the C. splendens OrCo, while the four specific ORs form a distinct and basal lineage relative to the termite and phasmatodid ORs ( Figure S8), consistent with them representing early, specific ORs. It remains possible, however, that one or two of the TdomOrCo-like proteins, for example TdomOr1 and 3, in fact have evolved the role of a specific OR (Missbach et al. 2014).
Three relatives of the commonly-expanded Ir75 clade were also found, and this clade in Drosophila responds to various acids and amines (Silbering et al. 2011;Prieto-Godino et al. 2016). No convincing relatives were found for the pair of Ir41a/76a in D. melanogaster that are also commonly expanded in other insects. No clear orthologs for any other of the 60 D. melanogaster IRs were discovered, and specifically no relative of Ir64a, which is implicated in perception of high concentrations of carbon dioxide (Ai et al. 2010). Many insects and other arthropods examined from full genome sequences also have highly divergent IRs in two distinct sets, those with a set of introns roughly comparable to those of the above IRs, and an "intronless" set (a few of which have idiosyncratic newly-acquired introns). Following an approach used for Z. nevadensis (Terrapon et al. 2014) and other arthropods (e.g. Hoy et al. 2016) these were numbered from Ir101, avoiding any confusion of possible orthology with the D. melanogaster IRs, which were named for their cytological locations and only go up to Ir100a. These intron-containing genes are particularly hard to model as they are so divergent, and just five are included here (Ir101-105), although there are fragments of more of them, plus a set of fragmentary pseudogenes that might be remnants of a once-expanded clade. Only five "intronless" genes were found (Ir106-110), although Ir110 has acquired two novel introns.
The IR phylogenetic analysis included representative IRs from Z. nevadensis, a few from D. melanogaster, and several from each of the three insects in Missbach et al. (2014), including several of their partial sequences because they usually include the more conserved Cterminal regions, facilitating alignment and phylogenetic analysis. It was rooted with the Ir8a and 25a proteins, which in larger analyses including the ionotropic glutamate receptors from which the IRs evolved, clearly cluster with them (Terrapon et al. 2014;Missbach et al. 2014) ( Figure S10). It reveals the orthology of Ir8a, 21a, 25a, 40a, 68a, and 93a, as well as the Ir75a-c set. Inclusion of the IRs from Missbach et al. (2014) reveals that the IR75 clade is even older than paleopterans because their LsigIr9 belongs in it. Interestingly, among the divergent IRs the intron-containing Ir105 is closely related to the intronless Ir106/107, indicating a recent loss of its five introns from a common ancestor, presumably through recombination with a cDNA copy. In contrast, Ir108-110 cluster confidently in the clade of intronless termite IRs, indicating a far more ancient loss of their introns (see Terrapon et al. 2014).
Ionotropic receptors have been implicated in both olfaction and gustation in D. melanogaster (Rytz et al. 2013;Koh et al. 2014;Stewart et al. 2015), and some are even involved in detection of other stimuli like temperature and humidity (Ni et al. 2016;Enjin et al. 2016;Knecht et al. 2016). It is remarkable that in addition to Ir93a the three conserved co-receptors, Ir21a, 40a, 68a and the Ir75 clade are present in this palaeopteran, indicating that they are at least this old in the insect lineage, with the IR75 clade being even older. It remains unclear what role the divergent Ir101-110 play in odonate chemosensation as they have only distant relationships with either the "antennal" or "divergent" IRs recognized in Drosophila which generally are involved in olfaction and gustation, respectively (Rytz et al. 2013;Koh et al. 2014;Stewart et al. 2015).

Conclusion
We describe a complete set of OrCo/OR proteins for a paleopteran insect, showing that this central system of insect olfaction had evolved by then. It remains to be seen from genome sequences of the other major lineage of paleopterans, the mayflies, whether they too have this set, and genome sequence for zygentomans and archaeognathans are also required before one can confidently conclude they do not have the complete system (Missbach et al. 2014). We note that from a deep transcriptome of the damselfly Ischnura elegans generated from head, thorax, and abdomen, Chauhan et al. (2014) claimed to have found a single "Odorant receptor 2t1-like", however they do not provide a sequence for it, and the "Most Similar Locus" they note (LOC101663266) was from a small Madagascar hedgehog and has been removed from GenBank.
The GR family that mediates much of insect olfaction is well represented in this damselfly genome, with a candidate sugar receptor, a receptor related to the carbon dioxide receptors of holometabolous insects, and 98 other apparently functional GRs including five that might be related to the fructose receptor of Drosophila and other insects. The remaining GRs are commonly encoded by large alternatively-spliced loci of the kind found in many other insects. Chauhan et al. (2014) reported seven GRs in their transcriptome, with similarities to Gr2a and 43a of D. melanogaster and other GRs in aphids and beetles, however we find no such convincing relationships to these receptors, and again their sequences do not appear to be publicly available. These are presumably fragments of GRs that have best matches to these other insect GRs, and probably are in fact related to the divergent GRs we describe.
The IR family has several highly conserved members, and in addition to the three coreceptors Ir8a, 25a, and 76b found widely in insects and beyond, we report Ir93a, which is a hygroreceptor also found beyond insects, and Ir21a, 40a, 68a, and a 75a-related clade for the first time in a palaeopteran. Ir21a and 40a are involved in thermoperception, however the combination of Ir8a and Ir75a-like proteins is likely to mediate olfactory perception of acids and amines, and some of the other divergent Ir101-110 might also be involved in olfaction, although IRs are also involved in gustation.
We therefore provide evidence for the likely molecular underpinnings of the emerging understanding that odonates are capable of olfaction in several ecologically relevant contexts (Rebora et al. 2012;Piersanti et al. 2014a;Piersanti et al. 2014b;Piersanti et al. 2016;Frati et al. 2015;Frati et al. 2016), and surely also gustation in both the context of feeding (Rebora et al. 2014) and oviposition (Rebora et al. 2013), and that these sensory modalities are important to their biology.

Odorant binding proteins (OBPs)
OBPs are small proteins expressed by support cells at the base of chemosensory sensilla, and secreted into the sensillar lymph where they are believed to bind and transport odorants from the atmosphere to chemoreceptors in the membranes of the dendrites of chemosensory neurons (Pelosi et al. 2006;Pelosi et al. 2014). They usually have six conserved cysteines (Classic OBPs) that form three disulphide bonds maintaining their globular shape with a binding pocket in the extra-cellular environment, although some have lost two of these cysteines and one of the disulphide bonds (Minus-C OBPs), while others have gained two more cysteines that are presumed to form a novel disulphide bonds (Plus-C OBPs). While most are expressed in antennae, some are more widely expressed. Genome sequences and antennal transcriptomes have revealed tens of OBPs in most examined insects, and the gene family is present back to basal Hexapoda such as Collembola ( Unfortunately most OBPs are rapidly evolving and highly divergent proteins, and are commonly encoded by 5-8 short exons, so they are difficult to find using TBLASTN searches of genome sequences. Fortunately they are highly expressed, so most are recovered as fulllength transcripts in antennal transcriptomes, and even whole body transcriptomes commonly contain at least partial transcripts, which are far easier to detect in TBLASTN searches because of their contiguity. We nevertheless first searched the genome using TBLASTN with Evalue=100,000 with all available OBPs from the most closely related available neopteran insects, the termite Z. nevadensis with 29 (Terrapon et al. 2014), the cockroach Blattella germanica with 48 (Niu et al. 2016), and the wingless firebrat T. domestica with 32 and bristletail L. y-signata with 40 (Missbach et al. 2015). This search revealed a single OBP, CsplOBP1, with ~50% identity for the mature protein region to TdomOBP1, ZnevOBP22, and BgerOBP38, and 30% identity to LsigOBP1, which are conserved orthologs of DmelOBP73a (Missbach et al. 2015;Niu et al. 2016). Although they cluster with Classic OBPs, these proteins have eight conserved cysteines, one more N-terminal and one more C-terminal than the conserved six cysteines, which is different from the Plus-C OBPs (see Missbach et al. 2015). All but the first exon of this 7-exon gene are supported by multiple reads from the 1KITE transcriptome for C. splendens (Misof et al. 2014), and the first exon was identified by similarity to orthologs in other damselflies identified below.
No other OBPs were identified this way, so we similarly searched the assembled transcriptomes of two coenagrionid damselflies, I. elegans (Chauhan et al. 2014) and C. puella (Johnston and Rolff 2013). Chauhan et al. (2014) found a fragment of one OBP, but do not provide its sequence. We identified highly conserved homologs of CsplOBP1 in both species as well as three more full-length OBPs from each species that allowed us to partially model their orthologs in C. splendens (CsplOBP2-4). The first exon of OBP genes typically encodes the signal sequence, and is followed by a phase 0 intron, and in this large genome with large genes and long introns, could easily be many kb upstream (for example, it is 23kb upstream for OBP1). It is therefore difficult to discover the first exon, and we failed for CsplOBP3 and 4, however it was identified using a single RNAseq read for CsplOBP2. The final short exon could not be identified confidently for CsplOBP2, however. These are Classic OBPs with six conserved cysteines, but are highly divergent from all OBPs in the four species above. They are encoded by six-exon genes with intron phases 0-1-0-1-0 (typical for OBPs and largely shared with OBP1, although OBP3 appears to have lost the final short divergent exon). CsplOBP2/3 are in a tandem arrangement and all three proteins share ~30% amino acid identity.
It is difficult to be certain we have identified all the OBPs in C. splendens because any OBP that is too divergent to find with TBLASTN in the genome sequence or expressed at too low a level to be found in the whole body transcriptomes of the two coenagrionid damselflies would not be discovered. Nevertheless the unusually small number of OBPs we found, an order of magnitude lower than many studied insects, is consistent with the small number of ORs.

Opsins
The C. splendens gene set was also searched for opsins using a set of 16 reference opsins (table S2) (Hering and Mayer 2014). All C. splendens genes that had a match with an e-value <1e-10 were retained as candidate opsins. In addition, HMM profiles were generated for each of the nine major opsin clades; cnidopsins, vertebrate c-opsins, pteropsins, Group 4 opsins, arthropsins, melanopsins, non-arthropod r-opsins, arthropod visual opsins and onychopsins (table S2) (Hering et al. 2012). These profiles were used to also scan the C. splendens gene set and genes having a match with an e-value <1e-30 were also retained as candidate opsins. The two candidate opsin sets were then merged, giving 34 candidate opsins, which were further examined for (a) existence of the conserved retinal-binding K296 residue (Palczewski et al. 2000), and (b) whether they had a significant match against an opsinrelated cluster in Uniref50. Unless the candidate opsin contained at least one of the above, it was discarded. This filtering step resulted in a final set of 17 genes that could likely represent real opsins. As a last step before the phylogenetic analysis, all 17 opsins were manually curated using WebApollo (Lee et al. 2013). Additionally, we extracted the partial sequence of another two opsins from the genome sequence. These opsins were not present in the predicted gene set and belong to two separate groups: RGR-like opsins and arthropsins. Finally, we compared these 19 banded demoiselle opsins to the opsins that were recently identified in another three damselflies and ten dragonflies (Futahashi et al. 2015). It should be noted that six of the opsins were basal to every other opsin in the analysis and also the branches leading to them were very long. Apparently, they are distantly related to opsins, but do not represent real opsins. In agreement with this result, none of them contained the K296 conserved residue, which is present in all other damselfly opsins. As a result, we repeated the phylogenetic analysis without these six genes, in order to avoid errors, such as long branch attraction. For the phylogenetic analysis, we applied the same methods as for chemoreceptors: MAFFT, Trimal. RAxML, EvolView and Inkscape v0.91.

Arrestins
One of the protein families in our blastclust clusters (see "Protein families"), with similarity to arrestins, had at least twice as many proteins in the damselfly genome compared to any other insect genome. In an attempt to better study this family, we first obtained additional genes from the InterProScan analysis, by fetching entries matching the keyword "arrestin". The results encompassed all genes found in the arrestin blastclust cluster and four extra matches, for a total of 14 damselfly arrestins. We then examined the amino acid sequences to verify that the corresponding gene models were not fragmented, by looking for the presence of complete C-or N-terminal arrestin domains (NCBI conserved domain online, last accessed April 2016). Subsequently, the amino acid sequences encoded by these genes were extracted and compared with those corresponding sequences of 18 D. melanogaster arrestins. We restricted our analysis to only the fruit fly arrestins because only these are wellannotated. Phylogenetic analysis was performed as before (see above) using MAFFT, Trimal, RAxML, Evolview, and Inkscape. Figure S1. TipE gene cluster. Figure S1. Conserved genomic arrangement of the TipE gene cluster in Calopteryx splendens. The last intron of the ortholog of Drosophila melanogaster CG18675 gene traps the Teh2, Teh3, Teh4, and TipE orthologs, and the Teh1 ortholog is located about 75Kb further downstream, matching the inferred ancestral arrangement of these genes in insects (Li et al. 2011). As in other insects, the last exon of TipE and of CG1867 have a short sequence region in common, although in different reading frames.   Figure S3: Comparison between eight insect species of InterPro entries of interest. The corresponding Pfam model is covered >75%. In the cases presented here, the three most abundant InterPro entries in the Calopteryx splendens gene set are shown, in comparison to other species (upper section). Cases in which C. splendens has a number of genes that is either above-average or below-average, are shown in the middle and lower section, respectively. Transposable elements were not considered. The value indicated on the right corresponds to the number of genes found in the species having the highest, and defining 100% on the scale. The height of each bar is proportional to its total gene count, following a square root scale.    Figure S6: Genomic region encoding a cluster of six sigma GSTs. Figure S6: Genomic region encoding a cluster of six sigma GST genes in the Calopteryx splendens genome. Large boxes correspond to exons and arrows indicate gene orientation.                    NESGRPVVFVSEALCKVNSKKARRELHLFSYQLLHTKIQFSACGFFPIDYSLLTSMAAAVVTHLVVLVQFQLSTKD  RSQCICPFTYGTEVPLLPPTPRS  >CsplGr45PSE  MGDRQLLWALKPMLWLNTTIGIAPPLLSSESEDKLLRRHRRRSIAVAFSMAFLTLLLTISSLIDIETFSLNSVISQMIT  VAWMLFYVTFGATSGCNYVVRLTAVTKIFKMLKNFEAILQSCPTAPLKTVRQSIECQLILVVFTTVQFFLNIWIVGYT  HGYDANFLITCVISIWAFINFMQNWQFCNKVLLLLKCFTSINIRIASLEIVDGFDGHRPHCDELGETVVSRIMYMKKL  QLNAYHVVEKMCGVYGISNFFFVALNFFNGTFELYYLIDSMVYVDRGAAWDSYFRTSTFLWVSAFFIQFYRTISAC  HRTQGETDQTGLVVFEALLKTMSNEVRTELHLLSLLHTKIRFTASGFFFLDRSLLTSISAAVLTYVVILVQFQISRKA  PCICT  >CsplGr46aJOI  MKLERSLVTLLFMFKISGMSLAYQLTPGKDSRNLERKLAVIYRATLGAAGVLLSFTSVTISCYHWQQMANEKLIPPF  WPTFEAFWMSSHFALGALTFAWFQLRCSQLSKLIQNLSRLGGEIDVRDKFLSRLKIWGLCLASGFSFFTALIVCVY  SIILSKRPWEKYNVITNTICSVVILTFQLYYLLVLYILAYKITVLNKNIRQFGCERESAAKDSTFRRDAKRVFQGSFLK  HTSHLKVISYLACFNLKLHSAFKITNEIFGLVILCQFAYSILNATFQLFDMLMNEGILHISLSSLFADSFVFFIYMFGFT  SIIALGQITEKQADRTSQLVTEAILKVKDDRLRRELRLFSHQLLFTRIRFTACGFFSLDFSLLTSMTAAVVTHLVILVQ  FQVADKQAACLC  >CsplGr46bPJ  MDVTRCEQIPPRSQSKRVDVKIYRPIAPLLLMFQACGMSLKDLLTSRRDETGFRRKLNSALVVFVAAVGVLFSSAS  MAVSFHEFRKVAAEGKIPVFWAVFEFIWLLIHHVLGZAVYAWFQLRRCQIARLMRILARTCCEIGLKEKFTAKLTILG  LCQLAYFSLFLICSVFAQYLTIMKLPRGATNAVAITLWTVVNQIIQLYLILVLCSFLCNIIILNNGIEKLGCNRGSIEMES  YVSHDANRKILFRGDGLENMGHFKEIMYWLSFHLELHSLFRSTNSIFGLVTLFQLSYSLLHTTFQLFDLINFILTDEP  ISTYFVSIYIILSFSLGSIAIIALCQDIKYKADRTSQLVTEAILKVKDDRLRRELRLFSHQLLFTRIRFTACGFFSLDFSLL  TSMTAAVVTHLVILVQFQVADKQAACLC  >CsplGr46cJOI  MVVISFPCKRAQRKFSLKARSSSNIFFSQNVIFPPRDGTMAGESFSLEKMDVFWVLSPIMRISKALGLAPFRLSVS  RKMRKSESKADCSVYYSSLKFSLFTIITFSMIISDFVMIEKYPGSWLVNKYLNIVWDSVDNLLSAGSVLVLLLRREK  SRDLLVRIAEYDRETDGSDYFLEERKVVEGQIAYVIVAFTIFGAYVYYGLHVLEVYGYLESVKVLIIVQWICSDLMM  HLQLYNVLLLLRRRLFRLNSRLKSLQYVRAGDWPEIVGGHPEISVGIIRRCRSRFYQIFQMCNRANHIYGITVSFSI  FYNFLDSTIILYFLLTMEVQDKVDIDNLHDLLYNGMLVVILSITLVLIISVCESILKEADRTSQLVTEAILKVKDDRLRRE  LRLFSHQLLFTRIRFTACGFFSLDFSLLTSMTAAVVTHLVILVQFQVADKQAACLC  >CsplGr47aJOI  MGKDVYYAISSLLLVSRIFGVAPFHTTYQRGRGKGSLPKWPLIASVSVITVTSLITVSALPSMESRRMKRFSNQFLT  VVSTTSTVLSGSAGIVSVYLFLLRSATAGKVLRGLRKYDDAYRVKGNADFRELRRKVRTTMFFVYLVYSLLSLNLIF  TFLKFSGHFNLFAQCILTYWSFIVVLLKTQFLSMLISLRQRISRLNEDVRLRLTSLPLPEYQHQWRGPRNTGGLPG  KARWWTMMQIRIFYLSKMTSEVYGPSWVPLMIYHFLNTTSVSYYSILYLFNESDFPNATGNPLAPGVWVCHQLA  GVFIVVVTCAGVSNEADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVVTH  LVILVQFQLSGKEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr47bJOI  MVLSEERERTKEDLILAMGPFLRFNRALGILFLDPENKRGENNFHKSQELSLLIHSLLMMGNFLLPVYMISSNDQF  NNTVGLNRAVNVFWIMLENYLNAISAFTLIPRLGKVRRIWMELFSFEEKVMADVKETQWSRASIRLRSFFAAYVYT  HFGVNWIVRVYVNGPLSLELLAFSHVIFWTTNLVMQGWQCYNAVALLHRCLFGLNENLRQQHQFRKGSGNSEIL  RELRNRVLVKKIRHLKALQVRVFHIFRSVSSVYAIPCLTLAALTLVNLTFTAYYALELAGITDVPGMPPFHFTCALAW  GVSYTTDLVLIASAGDGINKEADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTA  AVVTHLVILVQFQLSGKEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr47cJOI  MTEGENFPEGVLWALKPVLAFNWALGISPFYSDRTLDPKDSRSHYLVAIAMYSILVMLEFFMLPYGIFLFSKFDPN  ALVSDAVMTMWQITMIILNTSAGYTFVLRYSASRETIIKLLDFEETLVKPSLISLRKTRLWICSMAFYVAFAIVQFGIH  MRETVGPYGIFSIPSFLNFRMMFWSFSNLMINLQCFCLVLLLSWCILNTNASIRQTWEYFSSRNPGAFSDQEKRF  LFGRIKHLRKLQMQTYRIFRSLTNVYGLSTFAIGVSVLFCVTFMLYCILDVFLSKEPVLSVAMLIISFLWVVAIVQIFILI  LSACEKTQLQADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVVTHLVILVQ  FQLSGKEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr47dFJ  MKNTLRTANVIWAFNHLLRVNQFIGVAPFCLDKGGSRLAGKSRHRLEYGVIVYAVVSFLSFSFSTHVTVRLDHYLT  NSATSQTVLILWTILGELIGAISGFTFSLKLHVVQRVFRHFARYEKLSFEQSGIKLQRVRRSVKGQLIYVACALAILSS  NIGHTLSKRGFGVPFCNYLLVAFWAFVNMIQGLQFYSMMILLQSNFVDINDKLSSNIPPSSAVPNEAFNSSRFHAE  EGSAHRFKICRKLRMRCYHTHRDLNYLYGVSSLAMGGLNLVNVTFNLYCLTDLFINGERATMMPLQLTLSSAWVM  VALAPLVLVIFVCEATLMEADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVV  THLVILVQFQLSGKEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr47e  MYRLGLINTHWDVYLILRYLIGLTRILGVIPCLKQRGGRKWAFYNSWHRLIQNVLSFLTILSLTRTALGKSAISSTTAL  SNSLMNFWCAIETFVVLLAAQHFSLSRQAVQCIIKDLSTCNKLLAPSASQVASSKLTCSVRIQILFLTMAIAQSLSSFI  TGYKSGWKRIDFISLFLMNIWFAVSACQWSHLYATVFLLHLYLCTLNGFLRNLRSSFPGEKGSDATTSCTIDLRKN  FVAGFTRRERAAQLRAFRVFKNTSKAHSIAIIALGTQNLINITLICYFAINGMMNKEEVIMSGGDMFSTGFWTLNNF  TQMFLVVKVCNRTQNEADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVVT  HLVILVQFQLSGKEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr47f  MLMLSNLIGLAPSLIHLNGGNQSFRHYLRKIIISNLILATLSGVLLIQTLMNNWLFQLNSQVSQVVMITWLSSYISLG  AISGCVFTTKLANVKDIFIRLNEFEQELSRCSMVTLREIRKSILIQRALVYFLCLELLANIWSMGNSYGFSTYVFTMS  AVVIMWSFVNVIQQWQFCINAYLILRCFSNLNNEIYSWGKSNVIATSSYGSGRTSSECDSEAGKLRILRKLQLNAY  KTFKKICIVYGTSTFLCATLNFFNATFILYYLLDLISYQSHAAWSASFLLTSSTWVTTYFILFYCVLSTCVRAEKMAD  QTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVVTHLVILVQFQLSGKEKPTCH  CPYFNTSEIPPPSTTAMY  >CsplGr47g  MAEKKFSREEEVHHWLPKPLFLLHRAFGIAPWTLHSSTQRDPIQICVKTIYTAAFLCSLALNSCAWVIEIKYVFHTF  VTSLVIGLRCYDAFFGSFSVYIFALRYYAIKNIFFGLISIINIERNSNSRKTSLLSRILKFIYVLIVLFFICITFEGIFKSKIP  HLSKLTTLAIKTSLIWRFLNVLQIWQFISKVDTVRNVINYWNSKIKNLKNGAKNNQTSGAEFNQDCREPSAKEIRR  MRRVHLKIFIVTKEIQSVYGISAFTFIVRNFVTITFSIYMPLIIYLKLNNSVQNNEVTHRFWILWAANFLASSLLVIYCE  EFFNEADQTGVLVSEALLKVKNHKARRELHLFSHQLLHTKIRFTACGFFPLDYSLLTSMTAAVVTHLVILVQFQLSG  KEKPTCHCPYFNTSEIPPPSTTAMY  >CsplGr48  MARTQYSREKELDHRFSEPGLIYRFFGIAPWAIHSNGQSDASLKYVKAFYTVALIFSLALAACESYAKSRKKSLFH  SLTVSIVYYDIVVGCINGYVFVLRHNAVKNIFVDLMSIGKILNIHRDKRNNSLRSYLAGLFRLLILLFFILLTLERLFIFTL  PHLNSTTTISIKLAFIWNTVTFLQEWQLLNKMTVIRNFINNLNLRTRNLANGSYDMRASGVPLHREHGIGRLMMEI  RRLRQINLKIFAVIKDIQCVYGVPTLVFITRSLMGITLQYYLLVSDYLEKIMGAKNNDRTRYFWISGTIQLLATWIIAAS  CESTLSKADQTGLLVSEALLKVESHKARRELHLFSHQLFHTKVNLTACGFFPLGHSLLKSMAASVLIYFVILIQFQL  SVK  >CsplGr49aFJ  MRGGSRSRNNPREDMLWSLQPILIHGRIIGLPPYSSNRDEKKESHRYWPYLCHACVILTTWTVVTTVCISMIVSGY  DHFTRDSSLSLYLMLTWTLLENGLRIAAVCSLVHKRHACQNFFANLIKYDDSLQDTKTLRHRCTRKTVVNIMVFFV  CLWSFPLLLAVVRSNDAINKLQAVLSSFNSAMILVSGLQFKAVLIVLHLRVSSLNKEIQNLYGNERSVVKIPRSLVKL  RTRRCIKGPIREIEARQLYLFRLCKMLNNIYQVSNLFHNIANLITFVFTLYYILVYLVIESFPQLVTEVTLYTTWKVTVT  LLSTAMSVNSCEEISMTADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVV  THLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49bFPJ  MRGGSSSRKNPRENMLWZLQPILIHGRIIGLPPYSCINGEMKESHRELPYLFHASRIZTTWAVGTIVSSANTFSAN  DHYTRDSIIGLYVLLSWVLLVNGLAIVAVFSLVRLCHVYETFFADLIKYYDCLHDTKTLRNSYTRMTVVNXLIFVVCLL  CFPLFLEIVRSYDSISMLRAVLINSAMVLVSGLQLKAVLTVLQLPVSSLNLEIQNLYGYERSVVELPRSLVKLRTRRC  IIGAIRELDDRQLNLFRLCRRLNSIYQISNLFLNIANPMPFVFTLYYILVYPVIESYPRZINEIALPSTWQVAITFLSTAM  SINSCEEXMTADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVVTHLVILVQ  FQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49cFJ  MRGRSSSRKNLREGMLWSLQPILTLGRIIGLPTYSCISGEMKESLRDWPYLFHAYGILTTWTVVTIVSIANFLSGND  HFIRGISLSQFVMLFWELLGNGLPIAAVFSLVHQRHACENFFADLIKYDDCLYYTKSLRHRCTRKTVVNIMVFVVCL  WPFPLFLAVVRSYNAQGMLHAALSSCRSAMALVSGLQFKAVLIVLHQRVSSLNQEIQNLYGYERTVVELQRPLIKL  KSRRCILGTIREIKARQLHLFSLCRKLNNIYQVSNLFFNMANLMTFVFTLYSILVYLVIESFPQYVIEITVYYTWQVTIT  LLSTAMSVNSCEETSTSADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVV  THLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49dFJ  MRGRSSSRNNPREDLLWSLQPILIHGKIIGLPPYSNINGEMKKSHRDWPYLCHACGILITWTLVAIGSIANILLGNNH  FTRDSSLSLYVMVSWVLLENGLAIVAVCSLLHQRHACEMFFADLIKYDDCLQDNKTLRHSCTRKTVVNITAFIMCL  WCCPLILAVVLSYDALDMLKTALSVFNSAMVLVSGIQFKAILIVLHLRVSSLNQEIHNLYGYERTVVEHPRSLVKMR  SRICILDTIRELEARQLNLFRLCRELNNIYQVSNLFFNIANLMTLVFALYYLLVYLVIEPFPQYLTEVALYCTWQVTIILL  STAMSVYSCEETSMTADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVVT  HLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49eFJ  MREENSSKNNPREDMLWSLKPILIHGRIIGLPPYSCFKDDNKESIRDWPYLCHASGILTTWTVGTIVSSANIFSGND  QYTRDSIISLYVLWSWVLLGNSLAILAVCSLFHQRHAYENFFADLIKYHDFLHDTKTVRHRCTRKIVVNITAFIVCLW  PLPLFLAVVHSAEALGILNAALASFNSAMVLVSGFQFKAVLIVLHLQVSSLNQEIQNLYGYERSVVELPEPLVKMRT  RRCIIDTIRELEARQLNLFRLCRKLNNIYQISNLFLSIGNLMTFVFALYYILVYLVIESFPQCVTEIALYSTWQVTVALLS  TAMSVNSCEETSMTADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVVTH  LVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49fFJ  MGGGSSSRNNPRENMLWSLQPILIHGRIIGLPLYSCIRDEKKESHRDWSYLCHASGILTAWTVVTIVSIALILSGYD  HFTRDSIISQIVAFTWALLGNGLRIAAVDSLVHQRHACQNFIADLIKYDDCLQDTKTLRHRCTRKNVVNIMVFVVCL  WCFPLFLEIVRSYDTLGMLKAVLASFNSVMVLVSGLQFKGILIVLHLRVSSLNQEIQNLYCYERSVVEPPRSLVKLR  TRRCIIDTIREIEACQLYLFRLCKMLNNIYQISNFISQYCKSDYFCFYALLYAGLPSDRIVSAIRNRKRTSLCLASHHHI  AKYGHECADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVVTHLVILVQFQ  LAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49gPJ  IRGRRSSRNNSREDMLWSMKPILILZRIIGLPPYSYINGEMKKVHRDWPYLCLSYZILTTSKVGTIMSTANILSGND  QYTRDYIIYLYVLLSWFLLGNSFDIAAVCPLVHLRHACDNFFADMIKYDDCLHDTKTLGHRCTSKTVVNIMVYIVFM  WSCPLFLTIVRSYDAVGKLNSALSLCRSAMVLVSGLQFKAVLIVLQLRVSSLNQEIHNLYGYERSVVELPRSLVKLR  TRRCLIGTNGEIEHRQLYTVTRFCLCRKLNNIYFKFGRADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRF  TACGFFSLDFSLLTSMTAAVVTHLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49hPJ  REGCNVQHEFRKGVSWTLHSIPTLGRLMGLPPYCFIGNERKKPIRDGRLTFQAVGILATCIVGTIYSLETKZRGTTA  QFTYTSLYVTFSWSLMGNVLVFMAICSLVHRRHSCEKFLRDLIKYEESQRDTSTLKHSHTRKTVVSMMILCLGAW  CFPLAIVTCFDDIVSLTALSMIEVAMYLCSNAMVLVPGLQFKAFLIVLPLRVSSLNQGTRILVGXQTSILEHPMPSSK  LKIGRYMRDAIREARQLNLHILGRTLNDIYQVANLVHADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFT  ACGFFSLDFSLLTSMTAAVVTHLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49iFP  MSLVFRAPKVETAPIVCGGSSGKNKHLKDQFWSMHPMVAFGRIIGLPTNPFDGDGSREPRRDWLFPYHAAGLLT  KWGVGTIVSLASIASGEDPFTQDSPTSLIVMLLLTLMGNGIDMGAVVSLVHHRRTSEKFFGDLVKYDDGLRDTNTF  KHSHDWKDVVTMMALIVSSWCFLLALSXFGVSHSIHAZDMLKAELYLCSSAMVVVSGLQFKAFLIVLRLRFSRLH  RKYVPWRETKAQRRHIXCRMHERMTRDIEERQLNVHRLCRALYDIYQVPNLFYSIVNLTNSDFAFYSZFVFLVGD  FLFWYGDVISLYPNGQATLSLFNVPLDVSGCKEISDEADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFT  ACGFFSLDFSLLTSMTAAVVTHLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49jFP  MRGMSSSRNNPRKDMLWSLQPILIHGRMIGLPPYSSNSGEVKEAHRDWPYLCLAYGILITWTVGTIVSTVNILSRN  NQYTRDSSLSLYVMLSWVLLGNGLAIVAVCSMVRQRHDCENFFADLIKCDNCLQDTKTLRHSCTRKTVVNMMVFI  VFLWSFPLFLAVVRSYDALGMLNAALSLCRSAMDLVSGLQFKAVLTVLHLRVSNLNQEIQNLYRYESSVVELPRSL  VKLRTRRCIIDPIRDIEARQLNLYHLCRKLNNIYQFPNLFFSIGNLMNLVFTLZNIVYFLSDSFLRIISPFTLYNVWNIT  VIVLITTLDLNSCEETSIVADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVV  THLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49kFIX  MRRRSSSRKNPIEDLLWSLQPILILARIIGLPPYSSIRGEMKEAHRDWPYLCIAYGILTTWTVGTIVSIASILSGNAHYI  YYSSTSLCVIFFSKLLGNGLVMLAVFSLVRQRHACESFFADLIKYDDCLHYTKTLRHRCTRKTVLNITAFILGMWSF  PLFLAVVHSTDVQEIVRAALFLCRSAMVLVFGLQFKAVLIVLHLRVSSLNQEIRNLYGYESSVVELPRSLVKLRTRR  CIMDTIREIKARQLNLYHLCRKLNNIYQVPNLLFSIANLMTLVFTFYSLIVYFLSDSFLRIISPFTLYNVWNITVIVLITTL  DLNSCEETSIVADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKVRFTACGFFSLDFSLLTSMTAAVVTHLVILV  QFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49lFP  MLLSPNNFSPNTLSTHYHIPLKCEGSNVQHDPRKGTWTLHSIPTFGKLMGLPPYCFIGDERKKPIRDGRFIFQAVE  MLATWIVLHDIFPRRNSNTVFAKVTYTSRYVIFSWSLLGNGLIIMAVSSLVHRQHSVEKFLRDLIKYDDNZQDTSTL  KHSHTRKTFVSMMILDLGAWCFPLAIATCFHVLVSLTALSLIILAISLCSNAMVLVHGLQFKAFLIVLHLRFSSLNYGI  RTFVGFETSVLEHPIPSSKSRIGRCMRDAIRDMEARQLNLHILGRTLNDLYQVANLCZNILHLVNIMFEFYFLYVCIG  GEPPMMHGYTFSVYSSWHIFPAILLMVTDMSSCVDISKXADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTKV  RFTACGFFSLDFSLLTSMTAAVVTHLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP  >CsplGr49m  MELSHHRSEKNTASETSVTFNRQCTDNKSAKDVFWSLGPLLKLGKIMGLPPYTSERDGKSQPHSEGYFMYQTLI  VLMAWIAGVLVSIANMISGNDEFTRDSSTTLYIMVFSMLVVNILVMTAVCSMVLHGEACEDFLRDLFKYDGGLRDIK  TLKYSSTRRTIVSMMASFAFGWCFPLVLTAYFILVYSVDRGIILKGALTSGRSVMIIVPGLQFRALLIVLHQRMSCLN  QEIQSLVNFNSERPTRSNQVRNGRCLRDTICDAKARQLNLYRLCKVLNDIYQVPNLFYNIVNLTQTIFVLYFLFIYIK  GDLSVGFPNAATFYMTCVFMLNLAVTVLDMSSCAALSKEADQTGVLVSEALLKVKDHQARRELRLFSHQLLHTK  VRFTACGFFSLDFSLLTSMTAAVVTHLVILVQFQLAGRDTPTTCNCTQENSSMTMTGLVTTPLP