Genome sequence of the corn leaf aphid (Rhopalosiphum maidis Fitch)

Background The corn leaf aphid (Rhopalosiphum maidis Fitch) is the most economically damaging aphid pest on maize (Zea mays), one of the world’s most important grain crops. In addition to causing direct damage due to the removal of photoassimilates, R. maidis transmits several destructive maize viruses, including Maize yellow dwarf virus, Barley yellow dwarf virus, Sugarcane mosaic virus, and Cucumber mosaic virus. Findings A 326-Mb genome assembly of BTI-1, a parthenogenetically reproducing R. maidis clone, was generated with a combination of PacBio (208-fold coverage) and Illumina sequencing (80-fold coverage), which contains a total of 689 contigs with an N50 size of 9.0 Mb. The contigs were further clustered into four scaffolds using the Phase Genomics Hi-C interaction maps, consistent with the commonly observed 2n = 8 karyotype of R. maidis. Most of the assembled contigs (473 spanning 321 Mb) were successfully orientated in the four scaffolds. The R. maidis genome assembly captured the full length of 95.8% of the core eukaryotic genes, suggesting that it is highly complete. Repetitive sequences accounted for 21.2% of the assembly, and a total of 17,647 protein-coding genes were predicted in the R. maidis genome with integrated evidence from ab initio and homology-based gene predictions and transcriptome sequences generated with both PacBio and Illumina. An analysis of likely horizontally transferred genes identified two from bacteria, seven from fungi, two from protozoa, and nine from algae. Conclusions A high-quality R. maidis genome was assembled at the chromosome level. This genome sequence will enable further research related to ecological interactions, virus transmission, pesticide resistance, and other aspects of R. maidis biology. It also serves as a valuable resource for comparative investigation of other aphid species.

However, the mechanism of aphid resistance to these plant toxins is not known, and natural variation in benzoxazinoid content among maize inbred lines nevertheless influences growth and reproduction of R. maidis [6,7].
Damage caused to maize by R. maidis takes several forms, and the resulting yield losses can be quite variable from year to year. Growth and yield are reduced through the removal of photosynthates by large numbers of aphids [8]. On flowering-stage maize, aphids tend to congregate on the tassels, where large amounts of honeydew can prevent the release of pollen from the anthers, thereby reducing seed set by up to 90% [9,10]. Additional damage comes from the fact that R. maidis transmits several important maize viruses, including Maize yellow dwarf virus, Barley yellow dwarf virus, Sugarcane mosaic virus, and Cucumber mosaic virus [11][12][13][14][15].
In addition to feeding on maize, R. maidis also infests a variety of other monocot species, including barley, oat, rice, rye, sorghum, sugarcane, and wheat [4]. In one study, barley was reported as the most suitable grain crop host for R. maidis [16]. However, as in the case of maize, there is also considerable within-species variation for R. maidis resistance in barley [17].
The origin of R. maidis is likely in Asia, and it has been subsequently introduced in most grain-growing areas of the world. In almost all parts of its range, R. maidis is anholocyclic.
However, sexual reproduction has been reported in Pakistan and Korea, with Prunus ssp. as the primary host [18,19]. In populations in Japan and Kenya, males but not sexually reproducing females have been found [20,21]. Consistent with the sometimes permanently parthenogenetic life cycle of R. maidis, there is within-species variation in the chromosome numbers. Karyotypes of 2n = 8, 9, and 10 have been reported. There also is evidence of host specificity among the karyotypes. Whereas R. maidis strains on maize tend to have 2n = 8, those on barley generally have 2n = 10 [22,23].
Analysis of the assembled R. maidis genome identified horizontally transferred genes, repetitive elements, and likely xenobiotic detoxification enzymes.

Sampling and genome sequencing
Insect Colony. BTI-1, a corn leaf aphid (R. maidis) isolate, which was originally collected from maize (Z. mays) in New York State, was obtained from Stewart Gray (USDA Plant Soil and Nutrition Laboratory, Ithaca, NY). An isogenic colony was started from a single parthenogenetic female R. maidis and was maintained on barley (Hordeum vulgare) prior to the collection of insects for genome and transcriptome sequencing.
Genomic DNA was prepared from 100-200 mg of fresh R. maidis tissue using a previously described protocol [32]. Briefly, mixed-instar whole aphids were ground in liquid nitrogen and incubated at 65°C in microprep buffer made up of DNA extraction buffer (0.35M sorbitol, 0.1M Tris-base, pH7.5, 5mM ethylenediaminetetraacetic acid), nuclei lysis buffer (0.2M Tris-base, pH 7.5, 0.05M ethylenediaminetetraacetic acid, 2M NaCl, 2% cetyl trimethylammonium bromide), 5% sarkosyl and 0.5% sodium bisulfite for 30 min. This solution was then treated with chloroform:isoamyl alcohol (24:1) and centrifuged for 10 min at 14,000 g. The supernatant was treated with RNase A and DNA was pelleted by centrifugation at 4°C at 14,000 g for 10 min. DNA pellet was washed with 100% isopropanol and then with 70% ethanol and dissolved in 50 μl of nuclease free water. Around 50 μg of high molecular weight DNA was prepared for PacBio library construction and sequencing using SMRT Cell template preparation kits (Pacific Biosciences), and sequencing was conducted at the Icahn Institute and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai. A total of 16 SMRT Cells were run on the PacBio Sequel platform, yielding 70 Gb raw sequence data (Supplemental Table S1), representing a 208-fold coverage of the R. maidis genome, which was estimated to be 338 Mb using the kmer approach ( [33]; Figure 1). For short-read sequencing, one paired-end library was constructed using the Illumina TruSeq DNA sample preparation kit following the manufacturer's instructions, and sequenced on an Illumina HiSeq 2500 system, which yielded about 75 Gb of raw sequence data (Supplemental Table S2). Raw Illumina reads were processed to remove duplicated read pairs, which were defined as having identical bases in the first 100 bp of both left and right reads, and only one read pair from each duplicated sequence was kept. Illumina adapters and low-quality sequences were removed from the reads using Trimmomatic [34]. The kmer depth distribution of the cleaned high-quality sequences displayed a single peak (Supplemental Figure   S1), indicating that the sequenced sample has a low level of heterozygosity.

Transcriptome sequencing
Transcriptome sequencing (Illumina strand-specific RNA-Seq and PacBio Iso-Seq) was conducted to aid gene prediction. Total RNA was extracted using the SV Total RNA isolation kit (Promega: Catalog number: Z3100). Briefly, cells were lysed by grinding 100-120 mg of insect tissue in liquid nitrogen, followed by incubation at 70°C in RNA lysis buffer (4M guanidine thiocyanate (GTC), 0.01M Tris, pH 7.5, 0.97% β-mercaptoethanol) for 3 min. This solution was then centrifuged for 10 min at 14,000 g and the supernatant was passed through a spin column provided with the kit, followed by DNase treatment. RNA was washed with RNA wash solution (60 mM potassium acetate, 10 mM Tris-HCl (pH 7.5), 60% ethanol) and dissolved in 50 μl of nucleasefree water. Strand-specific RNA-Seq libraries were constructed using a previously described protocol [35] and sequenced at Biotechnology Resource Center of Cornell University on an Illumina HiSeq 2500 sequencing system. More than 188 million paired-end reads with lengths of 151 bp were obtained (Supplemental Table S2). Raw reads were processed by trimming adaptor and low-quality sequences using Trimmomatic [34]. The cleaned reads were aligned to the assembled R. maidis genome using HISAT2 [36], followed by reference-guided assembly using StringTie [37]. The assembled transcripts were used to improve protein-coding gene predictions in the R. maidis genome.
For Iso-Seq, 20 μg RNA, isolated from 100-120 mg of fresh R. maidis tissue using the SV Total RNA isolation kit (Promega) with the method described above, was shipped to Duke Center for Genomic and Computational Biology for PacBio large-insert (15-20kb) library construction and sequencing using standard SMRTbell template preparation kits. One SMRT cell was run on the PacBio Sequel platform, yielding ~10 Gb raw sequence data (Supplemental Table S1). The PacBio raw reads were processed using IsoSeq3 (https://github.com/PacificBiosciences/IsoSeq3).
Briefly, one representative Circular Consensus Sequence (CCS) was generated for each zero-mode waveguide (ZMW). Only ZMWs with at least one full pass, meaning that each primer has been seen at least once, were used for the subsequent analysis. The CCSs were processed to remove the 5' and 3' primers, trim off polyA tails and remove artificial concatemers to create full-length, nonconcatemer (FLNC) reads. The FLNC reads were then clustered together. The final polishing step created a consensus sequence for each clustered transcript. A total of 21,114 high quality transcripts were generated, and were used to support protein-coding gene predictions in the R. maidis genome.

Hi-C library construction and sequencing
For Hi-C sequencing, 200 mg of R. maidis tissue was used for chromatin isolation and library preparation using the animal Hi-C kit from Phase Genomics (https://phasegenomics.com). Hi-C Libraries were sequenced at the Biotechnology Resource Center, Cornell University, using the NextSeq500 platform (Illumina) to obtain 76-nt paired-end reads. Raw reads were processed by trimming adaptor and low-quality sequences using Trimmomatic [34]. The cleaned Hi-C reads were aligned to the assembled contigs using BWA-aln [38], and the optimal placement of each read pair was determined by BWA-sampe [38]. Reads that did not map within 500 bp of a restriction enzyme site were removed using the PreprocessSAMs.pl script in LACHESIS [39].
Finally, only reads with mapping quality greater than 30 were used for scaffolding by LACHESIS [39].

Genome assembly
The PacBio long reads were corrected and assembled with the Canu assembler [40] (version 1.6).
The resulting contigs were polished by aligning the raw PacBio reads to the assembly, and correcting the sequencing errors using Arrow (https://github.com/PacificBiosciences/GenomicConsensus). To further improve the assembly, another round of polishing was performed by aligning the Illumina short reads to the assembly and correcting the sequencing errors using Pilon [41]. The assembled contigs were then compared against the NCBI non-redundant nucleotide (nt) database using BLASTN with an e-value cutoff of 1e-5. Contigs with over 90% of their lengths similar to only bacterial or viral sequences were considered to be contaminants and were discarded. The final contigs were clustered and ordered into chromosomes by Hi-C reads using LACHESIS [39] with default parameters. Scaffolds were manually polished using Juicebox [42].
The assembled R. maidis genome had a total length of 326.0 Mb and consisted of 689 contigs with an N50 length of 9.0 Mb. Thus, this is a much-improved genome assembly compared to the six previously published aphid genomes ( of the assembly) were clustered into four groups, which was consistent with the commonly observed 2n = 8 karyotype of R. maidis [22]. Of the clustered contigs, 473 spanning 320.6 Mb (98.4% of the assembly) were successfully orientated (Figure 1, Supplemental Figure 2). To evaluate the completeness of the R. maidis genome assembly, the Illumina paired-end library were aligned to the assembly, allowing up to three mismatches using BWA-MEM [38]. With this approach, 94.9% of the Illumina reads could be mapped back to the assembly, indicating that most of the reads were successfully assembled into the genome. RNA-Seq reads also were aligned to the genome assembly using HISAT2 [36], resulting a mapping ratio of 94.5% (Supplemental Table S2). Furthermore, the completeness of the genome assembly, as evaluated by BUSCO (v 3.0.2 [43], showed that 95.8% of the core eukaryotic genes were at least partially captured by the genome assembly and 94.5% were completely captured. Taken together, our evaluations indicated an overall high quality of the assembled R. maidis genome.

Endosymbiont genomes
The genome sequence of the Buchnera aphidicola endosymbiont was separated from the R. maidis host genome sequences by aligning the initial assembly to the Buchnera reference genome (GeneBank ID: NC_002528.1). One single contig was extracted and polished using both PacBio long reads and Illumina short reads, as described above. Genome annotation was performed using prokka [44]. The assembled BuchneraRm genome had a length of 642,929 bp (Supplemental Figure S3), with a total of 602 predicted protein-coding genes. The two Buchnera plasmids, pLeu and pTrp, were also sequenced and assembled, with lengths of 7,852 bp and 3,674 bp, respectively (Supplemental Figure S3).
To identify secondary bacterial symbionts in R. maidis, raw assembled contigs were compared against the reference sequences of previously identified secondary bacterial symbionts of aphids, including Hamiltonella defensa, Regiella insecticola, Serratia symbiotica, Rickettsia, Spiroplasma, X-type, Arsenophonus, and Wolbachia [45], using BLAST. No hits were found, suggesting that these secondary bacterial symbionts are not hosted by the sequenced R. maidis strain.

Annotation of repetitive elements
We first identified MITE (miniature inverted-repeat transposable elements) from the assembled R.
maidis genome using MITE-Hunter [46], and then generated a de novo repeat library by scanning the assembled genome using RepeatModeler (http://www.repeatmasker.org/RepeatModeler), which integrates results from RECON [47], TRF [48], and RepeatScout [49] and classifies repeats with the RepBase library [50]. RepeatModeler identified a total of 546 repeats. We subsequently compared these repeat sequences against the NCBI non-redundant (nr) protein database using BLAST with an e-value cutoff of 1e-5, and those having hits to known protein sequences were excluded. Finally, we identified repeat sequences by scanning the assembled R. maidis genome using the de novo repeat library with RepeatMasker (http://www.repeatmasker.org/) and the RepeatRunner subroutine (http://www.yandell-lab.org/software/repeatrunner.html) in the MAKER annotation pipeline [51]. A total of 21.2% of the assembled R. maidis genome was annotated as repeat elements ( Table 2). The most predominant repeat elements were unknown repeats and MITEs, which occupied 5.6% and 4.4% of the genome respectively.

Gene prediction
Protein-coding genes were predicted from the genome assembly of R. maidis using the automated pipeline MAKER [51]. MAKER integrates the results from ab initio gene predictions with experimental gene evidence to produce final consensus gene set. The evidence that was used included complete aphid coding sequences collected from NCBI, transcripts assembled from our strand-specific RNA-Seq data, high quality transcript sequences from Iso-Seq, completed proteomes of Acyrthosiphon pisum, Aphis glycines, Diuraphis noxia, Myzus cerasi, Myzus persicae, and Rhopalosiphum padi, and proteins from the Swiss-Prot database. All of these sequences were aligned to the R. maidis genome using Spaln [52]. MAKER was used to run a battery of trained gene predictors, including Augustus [53], BRAKER [54] and GeneMark-ET [55], and then integrated the experimental gene evidence to produce evidence-based predictions.
Altogether, 17,647 protein-coding genes were predicted in the R. maidis genome. The gene count of R. maidis is close to those in Ap. Glycines

Comparative genomics
We compared the R. maidis genes with those of six other aphid species (Ap. glycines, M. persicae, Ac. pisum, M. cerasi, R. padi, and D. noxia), as well as the whitefly (Bemisia tabaci) [24][25][26][27][28][29][30][31]. The proteome sequences of all eight species were used to construct orthologous groups using OrthoMCL [58]. A total of 5,696 orthologous groups were shared by all 16 species, including 3,605 single-copy orthologous genes. These single-copy genes were used to reconstruct their phylogenetic relationships. Briefly, protein sequences of the single-copy genes were aligned with MUSCLE [59], and positions in the alignment containing gaps in more than 20% of the sequences were removed by trimAl [60]. A phylogenetic tree was then constructed using the Maximum-Likelihood method implemented in PhyML [61], with the JTT model for amino acid substitutions and the aLRT method for branch support. B. tabaci was used as the outgroup in the phylogenetic tree, which showed that R. maidis is close to R. padi, and separated from A. pisum and M. persicae (Figure 2), consistent with a phylogeny that was derived using mtCOI [62].

Identification of horizontal gene transfers
All of the R. maidis predicted gene models were compared against six protein databases derived from complete proteomes in UniProt, including those from bacteria, archaea, fungi, plants, metazoa (excluding proteins from other species in the Arthropoda), and other eukaryotes, using BLASTP. The index of horizontal gene transfer (HGT), h, was calculated by subtracting the bitscore of the best metazoan match from that of the best non-metazoan match [63].We required that these sequences were aligned better to the other five taxa than to the metazoan database, defining HGT candidates as those with h ≥30 and a best non-metazoan hit bitscore ≥100. The corresponding genome sequences of these candidates as well as 1000-bp flanking sequences at both ends were manually checked for potential genome assembly errors, and none were found.
We phylogenetically validated all HGT candidates. Their protein sequences were compared against the protein databases of six taxa (archaea, bacteria, fungi, plants, metazoan, and other eukaryotes) using BLASTP. The top five hits from each taxon were extracted, and aligned with the candidate HGT protein using ClustalW2 [64]. Each alignment was trimmed to exclude regions where gaps were more than 20% of sequences. Phylogenetic trees were constructed using PhyML  Table 3). The two bacterial genes were previously identified as horizontally transferred into Ac. pisum [65], and expression silencing of one of these genes, a bacteriocyteexpressed LD-carboxypeptidase A, was shown to reduce aphid performance [66]. A cluster of genes encoding multiple enzymes for carotenoid biosynthesis, which were horizontally transferred into the Ac. pisum genome from fungi [67], is also present in the R. maidis genome. Two R. maidis genes that cluster together with genes from trypanosomes and other protozoa have not been previously reported as horizontally transferred in aphids. Finally, nine genes encoding proteins containing ankyrin repeat domains show highest similarity to genes from unicellular algae in the genus Ostreococcus.

Detoxification and insecticide resistance
Cytochrome P450s, glutathione S-transferases (GSTs), carboxylesterases, UDPglucosyltransferases (UGTs), and ABC transporters function in the avoidance and/or detoxification of plant defensive metabolites [68,69], and insecticide resistance [70,71]. We identified such detoxification-related genes in R. maidis based on protein domains that were predicted through InterProScan [72].  Table 4 and Supplemental Table S3), consistent with R. maidis being a specialist monocot herbivore that may require a smaller repertoire of detoxification enzymes. Although the detoxification gene count in Ac. pisum was high, the average lengths of the protein sequences were shorter than those in R. maidis, Ap. glycines, and M. persicae (Supplemental Figure S4), suggesting that these genes could be incomplete or pseudogenes in Ac. Pisum, possibly due to a lower-quality genome assembly.

Conclusion
As the currently most complete aphid genome, our R. maidis assembly will provide a valuable resource for comparisons with other species and the investigation of aphid genome evolution.
Research on the ecological interactions of R. maidis, including host plant choices, detoxification of secondary metabolites, and gene expression responses, will be facilitated by the R, maidis genome sequence. Practical applications in agriculture may include the identification of virus transmission mechanisms and new targets for chemical pest control. . Figure S2. Hi-C contact map of the R. maidis genome . Figure S3. Circular view of the genome of the Rhopalosiphum maidis endosymbiont, Buchnera aphidicola (A) and its plasmids pLeu (B) and pTrp (C).

Ethics approval and consent to participate
This is not required for experiments with R. maidis.