Biocuration as an undergraduate training experience: Improving the annotation of the insect vector of Citrus greening disease

Shippy⍺,β,, Andrew Rosendale⍺,β,4, Chris Cordola, Tracey Bell, Hannah Mann, Gabe DeAvila, Daniel DeAvila, Zachary Moore, Kyle Buller, Kathryn Ciolkevich, Samantha Nandyal, Robert Mahoney, Joshua Van Voorhis, Megan Dunlevy, David Farrow, David Hunter, Taylar Morgan, Kayla Shore, Victoria Guzman, Allison Izsak, Danielle E Dixon, Liliana Cano , Andrew Cridge , Shannon Johnson , Brandi L Cantarel , Stephen Richardson, Adam English, Nan Leng, Xiaolong Cao, Haobo Jiang, Chris Childers⍺,, Mei-Ju Chen⍺,17, Mirella Flores⍺,Ƴ,, Wayne Hunter, Michelle Cilia,


INTRODUCTION
The Asian citrus psyllid (ACP), Diaphorina citri Kuwayama (Hemiptera:Liviidae), is a phloemfeeding insect native to Southeastern and Southwestern Asia with a host range limited to plants in the citrus genus and related Rutaceae spp. (1). Accidental anthropogenic introductions of psyllid-infested citrus combined with the ability of psyllids to disperse rapidly has allowed D. citri to extend its distribution to most of southern and eastern Asia, the Arabian Peninsula, the Caribbean, and South, Central and North America (1)(2)(3)(4)(5)(6). For years, ACP has been classified as a global pest that is capable of devastating citrus crops through transmission of the bacterial agent, Candidatus Liberibacter asiaticus, CLas, which is associated with Huanglongbing (HLB) or citrus greening disease. The psyllid alone has little economic importance and causes only minor plant damage while feeding (7,8).
HLB is the most destructive and economically important disease of citrus, with practically all commercial citrus species and cultivars susceptible to CLas infection (9). Infected trees yield premature, bitter and misshapen fruit that is unmarketable. In addition, tree death follows 5-10 years after initial infection (2,9,10). Furthermore, HLB drastically suppresses economic progress in southern and eastern Asia by impeding viable commercial citrus agriculture within those regions (11). Florida is one of the top citrus-producing regions in the world and the largest in the United States, with nearly double the output of California, the second largest citrus-producing state (12). HLB puts the 9 billion dollar Florida citrus industry, with an annual net value of 1.5 billion dollars, at tremendous risk [USDA 2009]. In 2008, the HLB infection rate within central Florida was low (1.4% to 3.6%), but reaching 100% in the southern and eastern portions of the state (13,14). In 2005, when HLB was first detected in Florida, 9.3 million tons of oranges were harvested, but production has declined steadily to 5.3 million tons in 2016 as ACP and HLB have spread (12).
Primary management strategies focus on disrupting the HLB transmission pathway by suppressing psyllid populations and impeding interactions between CLas and psyllids. These strategies currently rely on extensive chemical application, which has broad environmental impact and high costs, and are ultimately unsustainable. To develop molecular methods that exploit current gene-targeting technologies, detailed genetic and genomic knowledge, including a high quality official gene set (OGS), is required (15,16). Early efforts focused on D. citri transcript expression (3,(17)(18)(19), analysis of the full transcriptome (20,21) and, more recently, analysis of the D. citri proteome (15). Arp et al. (22) performed a BLAST-based inventory of NCBI-predicted immune genes in D. citri (v100, see Methods). In contrast, we have conducted broad structural and functional annotation with the aid of a comprehensive transcriptome and created an official gene set for D. citri with a focus upon completing the repertoire of immune genes.
Manual curation improves the quality of gene annotation and establishing a 'version controlled' official gene set (OGS) provides a set of high quality, well-documented genes for the entire research community. Although ACP is a significant agricultural pest, it is not a model organism and the size of the research community does not warrant "museum" or "jamboree" annotation strategies (23). To maximize the number of genes annotated in a relatively short time, we augmented the "cottage industry" strategy (24) by training undergraduates to perform basic annotation tasks. The dispersed ACP annotation community agreed on a set of standard operating procedures and defined a set of primary gene targets. Starting with automated gene predictions, we used several additional types of evidence including RNAseq and proteomics data including comparisons to other insects to generate the first official gene set (OGS v1.0). Using an independent transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome.
Our manual curation efforts focused on genes of potential use in vector control including immunity-related genes and pathways, RNAi machinery genes, multiple clans of cytochrome P450 genes and other genes relevant to insect development and physiology. We speculate that targeted analysis of these genes in D. citri will provide the foundation for a better understanding of the interactions between psyllid host and CLas pathogen, and will open the possibilities for research that can eventually find solutions to manage the dispersion of this very destructive pest and HLB.

NCBI-Diaci 1.1 draft genome assembly
The Diaci1.1 draft genome assembly was generated using Illumina paired-end and mate-pair data with low coverage Pacbio for scaffolding and uploaded to NCBI (PRJNA251515) after filtering out bacterial contamination. This genome contains 161,988 scaffolds with an N50 of 109.8kb. Given the high degree of fragmentation, we performed a Benchmarking sets of Universal Single-Copy Orthologs (BUSCO) version 1 (25) analysis using a set of conserved single-copy markers, which showed that a significant number of these genes were missing (26%) or fragmented (22%) as described in the curation section below.

MCOT transcriptome
To generate a more comprehensive set of gene models, we used the MCOT pipeline (26) to produce a D. citri MCOT transcriptome assembly that combines transcripts from genome-based Maker (27,28) and Cufflinks (29) pipelines with RNAseq-based (de novo) Trinity (30) and Oases (31) assemblers to select the best models. D.citri MCOT v1.0 contains 30,562 CDS, transcripts . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint and proteins, some of which are based solely on transcript evidence from RNAseq (ftp://ftp.citrusgreening.org/genomes/Diaphorina_citri/genome/diaci1.1/). The completeness of D. citri MCOT v1.0 was assessed based on comparison to the BUSCO v2 beta (25). BUSCO v2 is based on a set of 1,066 single-copy orthologs from 133 arthropods species ORTHODB v9 (32). D. citri MCOT v1.0 contains 1039/1066 (97.5%) complete BUSCO orthologs, 808 (75.8%) of which are single copy and 231 (21.7%) of which appear to be duplicated. Four additional BUSCO orthologs (0.4%) are fragmented, and only 23 (2.2%) were not present. BUSCO analysis was also performed on the other resources used for annotation including three stage-specific de novo assembled transcriptomes from egg, nymph and adult tissue (20), the NCBI-Diaci1.1 genome assembly itself, the Maker v1.1 and NCBI v100 predicted gene models, and the D.citri MCOT v1.0 gene models that could be mapped to the genome assembly ( Figure 1 and Supplementary Table 1). The Maker v1.1 annotation set contains 18,205 protein-coding gene models. NCBI D. citri Annotation Release 100 (NCBI v100, https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Diaphorina_citri/100/) contains a total of 19,311 protein-coding gene models along with annotations of non-coding RNAs (776) and pseudogenes (207). As shown in Figure 1, none of these data sets proved to be as complete as D.citri MCOT v1.0, which contained a higher percentage of complete BUSCO orthologs and fewer fragmented or missing orthologs.

Manual curation workflow
Manual curation of the D. citri (NCBI-Diaci1.1) draft genome assembly was undertaken to improve the quality of automated annotations (Supplementary Table 2) produced by the Maker (Maker v1.1) and NCBI pipelines (NCBI v100). This community-based manual annotation was focused on immunity-related genes as targets for ACP control.
The Apollo Genome Annotation Editor hosted at i5k Workspace@NAL (https://i5k.nal.usda.gov/) was implemented for community-based curation of gene models (34). Multiple evidence tracks (Supplementary Table 3) were added to Apollo to assist in manual curation. Standard operating procedures were outlined at the beginning and refined based on feedback from annotators and availability of evidence resources (see Methods). A typical workflow for manual gene curation involved selection of orthologs from related species, search of the ACP genome and assessment, followed by correction, of the gene models based on evidence tracks to generate the final model. Any exceptions to this general workflow are specified in the respective gene reports (Supplementary Notes 1-39). We used ImmunoDB (35) as a primary source of orthologs for curation of immune genes (Supplementary Notes 1-31). However, we also report other gene families of functional and evolutionary importance (Supplementary Notes 32-39) including aquaporins, cuticle proteins and secretory proteins. Following correction, final gene models were verified using reciprocal BLAST analysis. With this community-based curation effort, we annotated a total of 530 genes, the majority of which include genes predicted to function in immunity, development and physiology.

Annotation Edit Distance (AED) as a measure of quality for different annotations
We employed the Annotation Edit Distance (AED) (28,36) metric to evaluate the quality of the different annotation pipelines based on evidence from expression data. The AED had a two-fold application here, selection of the best predicted gene set and quantification of improvement after manual curation. To generate AED scores, we compared the annotations to known insect proteins (NCBI taxonomy:"Hexapoda [6960]") and to a comprehensive genome-guided transcriptome assembled from the latest data (See Methods and Supplementary Table 4) and the NCBI-Diaci1.1 draft assembly using the Maker pipeline (27). Most of the RNAseq data used to create this transcriptome was not used in the prediction of genes by other pipelines (NCBI v100, Maker v1.1 and MCOT v1.0) since it had not yet been produced and therefore provides independent validation. RNAseq data used for building the transcriptome included adult, nymph, egg, CLas exposed and healthy (adult and nymph tissue) as well as gut-specific expression data (see Supplementary Table 4). We calculated AED scores for NCBI v100, Maker v1.1 and mapped MCOT v1.0 annotation sets.
The plot for the AED cumulative fraction of transcripts ( Figure 2) shows significant expression evidence support for NCBI v100 genes compared to the Maker v1.1 and mapped MCOT v1.0 annotation sets. Therefore the NCBI v100 annotations were selected to create the official gene set v1 (OGS v1.0). The AED plot also shows that there are many gene models in all the annotation sets that do not have any expression evidence support (AED = 1.0). This could be due either to lack of RNAseq data for some of the loci or incorrect annotations. The predicted annotations may have been affected by misassemblies in the NCBI-Diaci1.1 draft genome.
To quantify improvements in the gene structure by manual curation, we calculated AED scores for only the curated genes (530 genes). Cumulative fractions of transcripts of AED for curated genes show improvements indicating that the intron-exon structures in the curated genes have been corrected by manual curation (Figure 2).

Training and curation strategy
Curation Strategy: Harnessing the Crowd A single individual or computer system is currently unable to fully curate a genome with precise biological fidelity. A growing number of genome sequencing projects are the combined efforts of global consortia, (for example parasitoid wasps (37), centipede (38) and bed bug (39)). This indicates that a centralized model of genome annotation, the design used in earlier sequencing projects for the fruit fly (40) and the human genome (41), is giving way to a global and 7 collaborative communities to generate high quality genome annotation. Beyond the problem of scale, curators require insights from others with expertise in specific gene families, which makes the process of curation inherently collaborative. Mobilizing groups of researchers to focus on these specific and manageable areas is more likely to distill the most pertinent and valuable knowledge from genome analysis.
We formed a team of 36 curators by enlisting collaborators primarily distributed across seven academic institutions (Indian River State College, Cornell University / Boyce Thompson Institute, Kansas State University, University of Cincinnati, University of Florida, University of Otago and University of Tennessee Health Center), including undergraduate and graduate students, postdocs, staff researchers, and faculty. Training was provided at in-person workshop sessions and online tutorials. Training sessions were structured to provide a review of principles associated with identification of specific genes by comparison to orthologs, gene structural aspects, how to search the genome for targets of interest, and using the combination of gene predictions and additional evidence tracks (Supplementary Table 3) to correct gene models with the Apollo platform. Lastly, secondary analyses, such as BLAST comparison to other insects and phylogenetic analyses, were used to examine specific genes or gene sets.
After initial in-person training, we continued to closely follow the team's progress via bi-weekly video conferences, collaborative documents on Google Drive and an online project management website (basecamp.com). The online forum facilitated discussions outside of the video conferences and allowed annotators to interact at their convenience. We established standard operating procedures for minimum evidence required for annotating a gene model, gene naming conventions and quality control checks (See Methods and Supplementary Table  5). This was valuable in ensuring that annotators followed consistent guidelines throughout annotation. Curation of a gene family by an annotator was followed by a presentation to summarize their results during the video conference and peer-review.
The undergraduate annotators at each site were mentored by a local faculty who coordinated separate in-person meetings. A few experienced annotators from the i5k community also participated in this curation effort. The video conferences facilitated a healthy discussion about curation methods among different groups of annotators. The combination of video conferences and online forum provided curators with the tools required to efficiently share data and information as they worked in teams across institutional boundaries. Moreover, expert community curators volunteered their time to multiple projects and this model allowed them to contribute according to their availability.
Manual curation was incorporated into a bioinformatics class in 2016 by one of the authors (Benoit) at the University of Cincinnati. Specifically, one week in the class was utilized for genome annotation through the D. citri Apollo instance for twenty-two students. This focused on how RNAseq datasets contribute to gene prediction and why these models need to be corrected manually. Each student was responsible for correcting two gene models. As part of this course, students were given an assessment test before and after the class. Three questions (Supplementary Table 6) of this test focused on gene prediction and genome annotation, which were the major educational focus of the genome annotation week. The average score on these questions before the class was only ~38%, which improved to ~90% at the completion of this course.
Two of the participating institutions (Benoit, University of Cincinnati; D'Elia, Indian River State College) integrated the D. citri genome annotation into senior capstone courses. Students that participated in capstone projects at UC covered eight topic areas; aquaporins, acidic amino acid transporters, glycosphingolipid metabolism, glycolysis, histone binding, vitamin metabolism, vitamin transport, and Hox genes. With the addition of the capstone course, a total of 28 UC students participated in the manual curation process. Six students participated in capstone projects at IRSC during which they annotated 15 gene families. In total, 25 students directly participated in the manual curation process, contributing to 39 gene reports. This strategy reinforces the use of undergraduate students in gene annotation, as students show an increase in learning and produce scientific reports which contribute to peer-reviewed publications.

Immune pathway in D. citri
Identification of the pathogen-induced immune components in ACP is critical for understanding and influencing the interaction between D. citri and CLas. The repertoire of immune genes is known to be a very diverse functional group and includes proteins that recognize infectious agents and initiate a signal, members of signal transduction pathways that relay the message to the nucleus, and the genes that are transcribed in response to infection. Below, we have briefly summarized our findings from manual structural and functional curation of immunity-related genes. Additional details are provided in gene reports for specific gene families and in the supplementary notes 1-31.

Pathogen recognition molecules
Recognition of infectious agents, the first step in immune defense, relies on a variety of cellular receptors including C-type lectins, Galectins, fibrinogen related proteins (FREP), Peptidoglycan recognition proteins (PGRP), Beta-1, 3-Glucan recognition proteins (βGRP) and thioestercontaining proteins (TEP). These proteins recognize pathogen associated molecular patterns (PAMPs) which are associated with microbial cells (42). Most of these recognition molecules are widely conserved in insects, although the copy number often varies. C-type Lectins Ten C-type lectins (CTLs) carbohydrate binding receptors (43,44), were identified in D. citri (Supplementary Note 1). These include three oxidized low density lipoprotein receptor genes and genes encoding C-type Lectin 3, C-type Lectin 5, C-type Lectin 8, E selectin, Perlucin, Agglucetin subunit alpha and an selectin-like osteoblast derived protein. The number of CTLs in D. citri is comparable to the number found in bed bugs (11), pea aphids (6) and honeybees (10).

Galactoside-Binding Lectins
. CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint A total of three Galactoside-Binding Lectins (galectins) and one partial galectin were identified and manually curated within the ACP genome (Supplementary Note 2). Galectins bind βgalactose with their structurally similar carbohydrate-recognition domains (45), which can function alone or in clusters creating a β-sandwich structure without Ca 2+ binding sites (46-48). Although we found more galectins in ACP than has been previously reported in the pea aphid (2 genes, (49)) and bed bug (1 gene, (39)), we did not observe a substantial lineage specific expansion as seen in dipterans.
Fibrinogen related proteins Similar to other hemipterans (pea aphid and bed bug), few fibrinogen related proteins (FREP) have been identified in ACP. Three complete FREPs were manually annotated (Scabrous, Angiopoietin and Tenascin) although partial un-annotatable FREP gene models were also detected. Like several of the other recognition molecule classes, the FREPs appear to have expanded in mosquitoes (50) (Supplementary Note 3). The suggestion that this expansion is related to blood feeding is consistent with the apparent absence in ACP of ficolin, tachylectins and aslectin, which are likely involved in detecting blood-borne parasites (50).

PGRP and βGRP
We identified one PGRP gene in D. citri. Insects have two classes of PGRPs: large (L) and small (S) (51). PRGP-L proteins recognize Gram-negative bacteria and activate the Imd pathway. PRGP-S genes interact with βGRPs such as GNBP to recognize components of Gram-positive bacteria and then activate the Toll pathway. Based on sequence similarity to other insect proteins, the D. ctiri PGRP protein seems to belong to the S class. We did not find any GNBP genes in D. citri (Supplementary Note 4). This is somewhat surprising, since GNBPs have been found in several hemipterans including pea aphids (49), bed bugs (39) and brown planthoppers (52).
Thioester containing proteins Only two Thioester containing proteins (TEP) were identified in ACP (Supplementary Note 5), which is comparable to the number found in Acrythosiphon pisum and Nasonia vitripennis. TEPs are members of an ancient protein family that includes vertebrate C complement and alpha-2-macroglobulin proteins (53). Insect TEPs seem to play a similar role to their vertebrate homologs, binding to invaders such as parasites or microbes, marking them for degradation, and they are upregulated by the JAK/STAT pathway during innate immune response (54).

Signaling cascades associated with pathogenesis
Once a potential infection has been detected, a cellular response is initiated by signaling cascade. Typically, Gram-positive bacteria and fungi cause activation of the Toll pathway, while the Imd pathway responds to Gram-negative bacteria (55,56). The JAK/STAT pathway plays a role in several immune functions, including antiviral defense (57). These signaling pathways are highly conserved, so the discovery that pea aphids (49), are missing many components of the Imd pathway was quite surprising. More recently, missing Imd pathway genes have been . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint reported in several other hemipterans(58-61), leading to speculation that association with Gram-negative endosymbionts may have favored the loss of these genes (61,62).

Toll Pathway
We identified four Toll receptors in D. citri (Supplementary Note 6). Comparison of the Toll receptors found in various insects suggests that there were six ancestral Toll receptors: Toll-1, Toll-6, Toll-2/7, Toll-8, Toll-9 and Toll-10 (63,64). Phylogenetic analysis indicates that the D. citri genes are orthologs of Toll-1, Toll-6, Toll-7 and Toll-8, but orthologs of Toll-9 and Toll-10 were not found. Pea aphids and bed bugs have Toll receptors from every class but Toll-9 (39,49). We found orthologs of five of the six Spätzle (Spz) ligand classes, including Spz1, Spz3, Spz4, Spz5 and Spz6 (Supplementary Note 7). The lack of Spz2 is not surprising since it has only been reported in Diptera and Hymenoptera. The downstream Toll pathway components are represented by single genes in most insects (65). Consistent with this, we identified single copy orthologs of tube, pelle, MyD88, TRAF6, cactus and dorsal (Supplementary Notes 8-13). Taken together, our findings suggest that the Toll pathway is largely conserved in D. citri, as it is in other insects.

Imd Pathway
As has been observed for several other hemipterans (49,58-61), many components of the Imd pathway appear to be missing in D. citri (Supplementary Note 14). We were unable to identify orthologs of Dredd, FADD, Imd, IKKG, and Relish. We did, however, find orthologs of pathway components IKKB, TAK1 and TAB, as well as FAF1/Caspar, a negative regulator of the pathway. Several Gram negative bacteria have been identified as D. citri symbionts, including Wolbachia, Candidatus Carsonella, Candidatus Profftella armaturae, and an as yet unidentified enteric bacteria closely related to Klebsiella variicola and Salmonella enterica (66-70). Given this information, it is tempting to speculate that the loss of many Imd pathway genes may, in fact, be associated with the ability of D. citri to acquire and harbor these Gram negative symbionts and might also be important for its ability to act as a carrier of CLas.
JAK/STAT Pathway The Janus kinase/signal transducer of activators of transcription (JAK/STAT) pathway is a signaling pathway that provides direct communication between the membrane and nucleus (71). We identified genes encoding the major components of the JAK/STAT pathway, namely the orthologs of domeless, hopscotch and marelle/STAT92E (Supplementary Note 15). The JAK/STAT pathway is involved in many developmental processes, in addition to its role in immunity, and has been found in all sequenced insects to date, including other hemipterans.

Response to pathogens and pathogen-associated stress
In response to infection, insect cells employ microbicidal compounds such as antimicrobial peptides (AMP), lysozymes and reactive oxygen species (ROS) to destroy invading cells and also activate tissue repair, wound healing and haematopoiesis processes. We searched the ACP genome for antimicrobial compounds (AMPs and lysozymes), the melanization-inducing . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint Clip-domain serine proteases (CLIP), the protective superoxide dismutases (SOD), and autophagy-related genes.
Antimicrobial peptides Although more than 250 antimicrobial peptides (AMPs) have been identified in insects, we searched the ACP genome for ten classes of known AMPs without success. The AMPs investigated included attacin, cecropin, defensin, diptericin, drosocin, drosomycin, gambicin, holotricin, metchnikowin and thaumatin. Although defensins are one of the most widely conserved, ancient groups of AMPs (72), the absence of defensin in ACP is not unprecedented as its absence has also been reported the hemipteran A. pisum (49). While the pea aphid is lacking defensin (as well as most other previously identified insect AMPs) it does contain six thaumatin (antifungal) homologs. Despite its presence in the closely related pea aphid, we were unable to identify thaumatin in the ACP genome. It must be noted that absence of previously identified AMPs does not necessarily suggest absence of all AMPs. AMPs are an extremely large, diverse group of molecules often defined by structure and function rather than conserved motifs, making identification through comparative sequence analysis difficult.

Lysozymes
Five genes encoding lysozymes were found in the D. citri genome (Supplementary Note 16). Lysozymes hydrolyze bacterial peptidoglycan, disrupting cell walls and causing cell lysis. Many insects produce lysozymes, particularly c-type lysozymes, and secrete them into the hemolymph following bacterial infection. C-type lysozymes that commonly defend against Gram-positive bacteria have been reported in many different insect orders including Diptera, Hemiptera, and Lepidoptera (73). Although c-type lysozymes were not found in the initial search of the ACP genome, two c-type lysozyme transcripts were found in D. citri MCOT v1.0 and were subsequently used to identify these genes in the ACP genome. Additionally, three i-type lysozymes were identified in the ACP assembly.

Superoxide dismutases
Insect hemocytes can produce a burst of reactive oxygen species (ROS) to kill pathogens (74). Since ROS are also damaging to host cells, superoxide dismutases are necessary to detoxify ROS. We found a total of four superoxide dismutase (SOD) genes in D. citri The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint Using the D. citri genome and the MCOT gene set, we identified D. citri orthologs of autophagyrelated genes known in Drosophila (Supplementary Note 19). Autophagy is the regulated breakdown of unnecessary or dysfunctional components of the cell. This process is highly conserved among all animals and is critical to the regulation of cell degradation and recycling of cellular components. The main pathway is macroautophagy, where specific cytoplasmic components are isolated from the remaining cell in a double-membraned vesicle called the autophagosome (80)(81)(82). We identified 17 out of 20 autophagy-related genes ( Supplementary  Note 19). There is only a single autophagy-related 8 gene in ACP gene sets compared to two for the Drosophila gene set, but this is common for non-dipteran insects (81). Thus, as expected, psyllids have the required repertoire of autophagy-related genes to undergo macroautophagy.
In summary, there is a reduction in the number of immune recognition, signaling and response genes in D. citri compared to insects from diptera. The reduction in the immunity genes is also observed in other hemipteran insect genomes such as A. pisum (49), P. humanus (58), Bactericerca cockerelli (59), R. prolixus (60) and B. tabaci (61). The reduction of immunityrelated genes in these insects has been attributed to association with their endosymbionts, which sometimes complement the immunity of the insects (83,84). In addition, this reduction in immune genes may be associated with insects that feed on nutritionally poor and relatively sterile food sources, such as blood and fluid from the xylem/phloem (58,83). However, bloodfeeding mosquitoes actually show an increase in immune genes and this expansion has been attributed to the likelihood of encountering pathogens in their food source. Arp et al. (22) pointed out additional inconsistencies with the diet hypothesis, including the presence of a full immune system in an insect species that develops in a sterile environment.

RNA interference pathway in D. citri
The RNA interference (RNAi) pathway is a highly conserved, complex method of endogenous gene regulation and viral control mediated through short interfering RNAs (siRNAs), microRNAs (miRNAs), and piwiRNAs (piRNAs). While all of these small RNA molecules function to modulate or silence gene expression, the method of gene silencing and the biogenesis differs (85). In Drosophila, it appears that genes in the RNAi machinery have subfunctionalized to have roles in specific small RNA silencing pathways (86)(87)(88)(89)(90)(91)(92). While the RNAi machinery genes have been shown to be conserved across major taxa, functional studies in insects have been limited to a handful of diptera. Investigating the complement of RNAi genes in D. citri may provide insight into the role that RNAi has on the immune response of phloem-feeding insects and could aid in better use of RNAi as a tool for pest management (93)(94)(95).
Core machinery Class II (Drosha type) and class III (Dicer type) RNase III enzymes play an essential role in the biogenesis of small RNA molecules with Drosha and Dicer1 functioning to produce miRNAs and Dicer2 functioning to produce siRNAs (96)(97)(98) dsRNA binding proteins act in concert with RNase III-type enzymes to bind and process precursor dsRNA molecules into small effector molecules (88,92,99,100). In some cases, these dsRNA binding proteins also function to load small RNA molecules into the RISC (101)(102)(103). In Drosophila, Pasha partners with Drosha (100), Loquacious (Loqs) partners with Dicer1 (92,99), and R2D2 partners with Dicer2 (88). In the D. citri genome, we identified two pasha homologs (Supplementary Note 22) and two loqs homologs but were initially unable to identify a true r2d2 ortholog (Supplementary Note 23) in the genome. The apparent absence of r2d2 was consistent with previous reports (22,104). However, a search of the MCOT (MCOT18647.0.CO) transcriptome identified a gene with similarity to R2D2 orthologs from bed bug, Tribolium castaneum and mosquitoes. While r2d2 is likely to be present in the D. citri genome, it is not annotatable given the limitations of the current assembly. Alternatively, if r2d2 is missing from in D.citri, it is possible that one of the Loqs proteins identified functions in the RNAi pathway (104), as Loqs has been shown to associate with Dicer2 in both Drosophila and Aedes aegypti (105,106).
Argonaute (AGO) proteins present small RNA guide molecules to their complementary targets through silencing complexes and provide the 'Slicer' catalytic activity that is required for mRNA cleavage in some RNA silencing pathways (107)(108)(109)(110). In Drosophila AGO1 is involved in the miRNA pathway, AGO2 is involved in silencing by siRNAs (86,91)  In summary, the D.citri genome has a full complement of RNAi machinery genes. Duplications are more frequent in genes that have previously been associated with the miRNA pathway (drosha, pasha and loqs) as opposed to the RNAi or piRNA pathways. This is an interesting . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint finding as the same result was found upon analysis of the pea aphid genome (112) but was not seen in the whitefly genome (61).

Building the foundation for P450/Halloween genes targeting to Reduce Insect Pests
Cytochrome P450s (CYPs) in eukaryotes are heme containing membrane bound enzymes that activate molecular oxygen via a mechanism involving a thiolate ligand to the heme iron. Usually this requires an electron donor protein, the NADPH cytochrome P450 reductase, in the ER or ferredoxin and ferredoxin reductase in the mitochondria (113,114). Insects have four deep branching clades on phylogenetic trees and this represents some losses during evolution as up to 11 clades are found in other animals. These are termed CYP2, CYP3, CYP4 and mitochondrial clans in P450 nomenclature (115). Most species have a tendency to expand P450s in one or more clans via tandem duplications. One interpretation of these P450 "blooms" is diversification to handle many related compounds from the environment that may be toxic or potential carbon sources.
D. citri in its current assembly has 60 P450 genes that are identified and named as distinct P450s. There are also numerous fragments named as partials. Diaphorina has a P450 bloom in the clusters CYP3172, CYP3174, CYP3175, CYP3176, CYP3178 in the CYP4 clan. There is another in the CYP3167 family in the CYP2 clan and a third that includes CYP6KA, CYP6KC and CYP6KD in the CYP3 clan. D. citri has 3 CYP4G genes.
CYP2 and mito clans have many 1:1 orthologs but these are rare in the CYP3 and CYP4 clan ( Figure 3). One exception in the CYP3 clan is CYP3087A1 and two neighbors CYP6DB1 and CYP6KB1 as they may be orthologs and probably should be in the same family. R. prolixus and A. pisum CYP3 clan genes have undergone gene blooms that were not found in D. citri. This may be interpreted as the common ancestor having few CYP3 clan P450s. The cluster at arc A ( Figure 3) on the tree consisting of CYP6KB1, CYP6DB1 and CYP3087A1 may be evidence that these three are orthologs and should be in the same family. The large aphid clade of CYP6CY at arc C has no members from Rhodnius or Diaphorina, so it seems to be aphid specific. At arc B ( Figure 3) the CYP395 family has four subfamilies C, D, E, F. There are many CYP395 genes in other hemiptera species, including C. lectularius (bedbug CYP395A,B), Apolygus lucorum (Hemiptera, a Mirid bug, CYP395G, H, J, K, L, M)), and Cyrtorhinus lividipennis (Hemiptera, green mirid bug, CYP395H, J, N). The fact they have been placed in different subfamilies suggests they are diverging from their common ancestor. The CYP3084 family with subfamilies A, B, C, D is only found in Rhodnius so far (116). The families CYP3088, CYP3089, CYP3090 and CYP3091 are also Hemiptera specific with some members in the same species noted earlier. The number of P450s varies with arthropod species from a low of 25 in the mite Aculops lyoperscii and 36 in Pediculus humanus (body louse) to over 200 in Ixodes scapularis (black-legged tick) and up to 158 in some mosquitos (117).
CONCLUSION . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint We report the first draft assembly for the D. citri genome and the corresponding official gene set (OGS v1.0) which includes 530 manually curated genes and about 20,000 genes predicted by the NCBI Eukaryotic Genome Annotation Pipeline (NCBI v100). The community curation effort involved undergraduate students at multiple locations who were trained, individually or in a class setting, in gene curation as a part of this initiative. These students were supported by contributions from expert annotators in the insect genomics community. The major advantage of having both expert curators and undergraduate students work together in the annotation project was the training, exchange of ideas and community building. We also present standard operating procedures that can be used to guide and coordinate annotation by large virtual teams. This community annotation process will be continued to improve structural and functional characterization of the ACP genome as new versions are generated. The MCOT transcriptome reported in this paper and also available at citrusgreening.org offers a genome-independent and comprehensive representation of the gene repertoire of D. citri that was used to improve the genome annotation. The MCOT transcriptome allowed us to identify lineage specific-gene models in D. citri and curate them. This gene set will support efforts in other hemipteran species that transmit bacterial pathogens.
In summary, we curated and described genes related to immunity and the RNAi pathway in addition to the cytochrome P450 genes. We report blooms in P450 genes in the CYP4, CYP2 and CYP3 clans which may be an evolutionary response to environmental stresses. Other important gene families that were curated as a part of the official gene set include aquaporins, cathepsins, cuticle and secretory proteins. We found the number of immunity-related genes to be reduced, even after direct targeting for improvement, in the D. citri genome similar to pea aphid and whitefly, which may reflect the association with microbial symbionts that have coevolved in both insects and the consumption of relatively sterile plant derived fluids. The genomic resources from this project will provide critical information underlying ACP biology that can be used to improve control of this pest.

DNA extraction and library preparation
High-molecular weight DNA was extracted using the BioRad AquaPure Genomic DNA isolation kit from fresh intact D. citri collected from a citrus grove in Ft. Pierce, FL and reared at the USDA, ARS, U.S. Horticultural Research Laboratory, Ft. Pierce, FL. To generate PacBio libraries, DNA was sheared using a Covaris g-Tube and SMRT-bell library was prepared using the 10Kb protocol (PacBio DNA template prep kit 2.0; 3-10Kb), cat #001-540-835.
Genome sequencing and assembly Samples were prepared for sequencing using the TruSeq DNA library preparation kits for paired-end as well as long-insert mate-pair libraries. All were sequenced on the Illumina HiSeq2000 using 100bp or longer reads. Seven libraries were sequenced, with inserts ranging from "short" (ca. 275bp) to 10Kb. These are available in NCBI SRA and included 99.7 million . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint paired-end reads (NCBI SRA:SRX057205), 35.1 million 2kb mate-pair reads (NCBI SRA: SRX057204), 30 million 5kb mate-pair reads (NCBI SRA: SRX058250) and 30 million 10kb mate-pair reads (NCBI SRA: SRX216330). A second round of DNA sequencing was done with PacBio at 12X coverage (NCBI SRA: SRX218985) for scaffolding the Diaci1.0 Illumina assembly to create the Diaci1.1 version of the D. citri genome. Thirty-nine SMRTcells of the library were sequenced, all with 2×45 minute movies. A total of 2,750,690 post-filter reads were generated, with an average of 70,530 reads per SMRTcell. The post-filter mean read length was 2,504 bp with an error rate of 15%.
Maker and NCBI annotation The maker control files are included in supplementary data. BLAST 2.2.27+ was used with augustus version 2.5.5 (121) and exonerate version 2.2.0 (122). Only contigs longer than 10kb were selected for annotation with psyllid specific transcriptome sequences from Reese et al. (20). Details of ACP genome annotation by the NCBI Eukaryotic Genome Annotation pipeline are available at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Diaphorina_citri/100/.
Annotation edit distance We mapped transcripts from MCOT v1.0 to the genome using GMAP (123). Out of 30,562 transcripts, 19,744 were mapped with at least 90% query coverage and 90% identity.
A genome guided transcriptome was generated to validate all annotation sets using all available RNAseq data (Supplementary Table 4) and insect proteins (NCBI taxonomy:"Hexapoda [6960]") from Swiss-Prot were used as sources of evidence. 622 million paired-end reads were mapped to NCBI-Diaci1.1 assembly using hisat2 (124) with a mapping rate of 81.78%. Mapped files were sorted with samtools rocksort. After sorting, a genome-guided transcriptome assembly was performed using StringTie (125) and the resulting transcriptome contained 210,890 transcripts (N50: 1,691bp).
MCOT transcriptome assembly The MCOT v1.0 set was generated with the MCOT pipeline (26). Maker v1.1 gene models and RNAseq data from adult, nymph and egg tissue (20) were used to generate a genome-based transcriptome assembly using Cufflinks (29). Denovo transcriptome assemblies of the adult, nymph and egg RNAseq data were performed with Trinity (30) and Oases (31). These are available at ftp://ftp.citrusgreening.org/annotation/MCOT/. Transcripts from Maker (27,28), . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint Cufflinks (29), Trinity (30) and Oases (31) were translated to proteins with Transdecoder version 2.0.1 (126) and unique proteins were kept.
Reads were assembled with Trinity in two runs, one used reads as single end reads, and the other used them as paired end reads. Reads were trimmed based on fastq quality score (the -trimmomatic option was enabled and run under the default setting of Trinity). The transcripts of both runs were combined to make the final Trinity assembly.
Velvet-Oases (31) assemblies were performed for trimmed reads (trimmed in Trinity run, the read quality control step) from the egg, nymph and adult separately, with kmer length of 23, 25, 27 and 29 as single end reads, and kmer length 25 as paired end reads. The outputs from kmer 15 and kmer 27 were combined using the Oases merge function (--long and -min_trans_lgth 200) to generate the final assembly.
Reads from the egg, nymph and adult were first aligned to the genome with Tophat (127) with the insert length parameter based on each library (53, 24 and 90). The parameter --readrealign-edit-dist were set to 0 and -r 90 to ensure better alignment results. Gene models were generated by Cufflinks (128) with default settings, with the -frag-bias-correct and -multi-readcorrect function (-b, -u) enabled to give the most accurate gene models.
Furthermore, protein sequences from each program (Maker, Oases, Trinity, Cufflinks) were compared with BLASTP, with a special scoring matrix (matching score of non-identical amino acids setting to -100 of the BLOSUM62 matrix), and compared with proteins from other arthropod species by normal BLASTP alignment. The best protein models from each source were selected to make the final MCOT v1.0 protein set, and the corresponding transcript set. MCOT v1.0 set has 30,562 genes and is available at ftp://ftp.citrusgreening.org/annotation/MCOT/.
MCOT v1.0 was analyzed using Interproscan 5 (129) and InterPro database, which contains predictive information about protein function with the options -goterms to get GO terms,iprlookup to switch on look-up of corresponding InterPro annotation and -pa option to switch on lookup of corresponding Pathway annotation. We also performed BLAST analysis on the MCOT set using the NCBI nr, Swiss-Prot and the TrEMBL protein databases. These BLAST and InterproScan results were used as input for AHRD (Automated assignment of Human Readable Descriptions) (33), to assign descriptions, Pfam domains and Gene Ontology terms. This functional annotation was performed using a filter of bit score more than 50 and e-value less than e-10.
Community curation for structural and functional annotation of genes Gene lists and orthologs were made available via the project management website (https://basecamp.com/2923904/projects/9184795) so that the annotators could volunteer to analyze genes of interest. We used ImmunoDB (35) as a primary source of immune gene orthologs, which provided expert-curated immunity genes for Aedes aegypti, Anopheles gambiae, Drosophila melanogaster and Culex quinquefasciatus. Other closely related . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint organisms used as sources of annotated orthologs include bed bug (Cimex lectularius) (39), pea aphid (Acyrthosiphon pisum)(62) and milkweed bug (130). All communication was done via emails and the project management website, which was also used to store presentations, gene reports, meeting minutes, working documents and data files.
Candidate gene models were identified on the D. citri genome by using orthologous proteins as query in Apollo blat and i5k BLAST. The NCBI conserved domains database (131) was used to identify the conserved domains in the orthologs and candidate genes. Multiple sequence alignments were generated using MUSCLE (132), tcoffee (133) and clustal (134) to compare the ACP gene model to the query gene set. The final model was refined in Apollo using homology, RNAseq and proteomics evidence tracks. MEGA7 (135) was used to construct phylogenetic trees. Please see individual gene reports in supplementary notes 1-39 for detailed methods. When available, published literature was used to putatively assign molecular functions, participation in biological processes, and cellular localization for an annotated gene, associating term identifiers from the Gene Ontology Database (136) and PubMed identifiers from NCBI. Curated genes were assigned names and descriptions based on the function and domain structures available in published literature or NCBI.
Another strategy for identification of candidate genes involved preprocessing the query before searching the ACP genome on Apollo. A set of representative genes was BLASTN searched against a database of all contigs in the ACP genome. These contigs were BLASTX searched against all the database of named insect genes. Analysis of the match helped to identify missing exons and extend partial exons to achieve the best possible manually curated version of the ACP gene. Once gene models were reconstructed from the genomic DNA, they were located on the D. citri genome using the i5k BLAST server. The models were then manually edited based on evidence tracks to produce the final gene model.
We performed multiple cycles of internal review of curated gene models to identify errors and suggest improvements to annotators. Annotations were ranked a scale of A-D (Supplementary Table 5) during review depending on completeness and support from evidence tracks. Standard gene naming conventions were defined and agreed upon to ensure consistency. The curated gene models were exported from Apollo and validated using the i5k quality control pipeline (https://github.com/NAL-i5K/I5KNAL_OGS/wiki/QC-phase) which identifies intra-model, intermodel and single-feature errors.

46.
Leffler . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint
. CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted January 9, 2017. . https://doi.org/10.1101/099168 doi: bioRxiv preprint