Portiera Gets Wild: Genome Instability Provides Insights into the Evolution of Both Whiteflies and Their Endosymbionts

Abstract Whiteflies (Hemiptera: Sternorrhyncha: Aleyrodidae) are a superfamily of small phloem-feeding insects. They rely on their primary endosymbionts "Candidatus Portiera aleyrodidarum" to produce essential amino acids not present in their diet. Portiera has been codiverging with whiteflies since their origin and therefore reflects its host’s evolutionary history. Like in most primary endosymbionts, the genome of Portiera stays stable across the Aleyrodidae superfamily after millions of years of codivergence. However, Portiera of the whitefly Bemisia tabaci has lost the ancestral genome order, reflecting a rare event in the endosymbiont evolution: the appearance of genome instability. To gain a better understanding of Portiera genome evolution, identify the time point in which genome instability appeared and contribute to the reconstruction of whitefly phylogeny, we developed a new phylogenetic framework. It targeted five Portiera genes and determined the presence of the DNA polymerase proofreading subunit (dnaQ) gene, previously associated with genome instability, and two alternative gene rearrangements. Our results indicated that Portiera gene sequences provide a robust tool for studying intergenera phylogenetic relationships in whiteflies. Using these new framework, we found that whitefly species from the Singhiella, Aleurolobus, and Bemisia genera form a monophyletic tribe, the Aleurolobini, and that their Portiera exhibit genome instability. This instability likely arose once in the common ancestor of the Aleurolobini tribe (at least 70 Ma), drawing a link between the appearance of genome instability in Portiera and the switch from multibacteriocyte to a single-bacteriocyte mode of inheritance in this tribe.

For each whitefly species collected, the 5' region of the mtCOI gene was amplified using the universal primers LCO1490 (F) and HCO2198 (R) (Folmer et al., 1994). In cases were this set of primers failed to amplify, the C1-J-2195 (F) and L2-N-3014 (R) primer set targeting the 3' region was used (Frohlich et al., 1999). In species were both sets of primers failed to amplify and mtCOI sequences were available at public databases, species-specific sets were designed (Table S2). For PCR amplification, primers (0.5 mM each) were mixed with the KAPA2G Robust HotStart ReadyMix (Kapa Biosystems) inside a DNA/RNA UV-Cleaner cabinet (UVC/T-AR). PCR was performed using the following general profile: 95 • C for 5 min,[95 • C for 30 sec, Tm • C for 15 sec, 72 • C for 1 min]x35, 72 • C for 5 min. Annealing temperature (Tm) was set up for each primer set according to Primer3 predictions (Table S2). When required, the temperature was adjusted trying 5 • C above or below of the predicted Tm.
The primers sets targeting the mtCOI gene had lower efficiency and produced the expected PCR fragments in 18 of the 26 samples. The LCO1490/HCO2198 set produced a PCR fragment in 15 samples but two of them were parasitoid wasp sequences instead of whitefly sequences. The C1-J-2195/L2-N-3014 produced a PCR fragment in one sample. The other four successful amplifications used species-specific mtCOI primers. The mtCOI amplicon presented saturation values that were higher than those calculated for the five amplified Portiera genes which did not show signatures of substitution saturation in their phylogenetic signal (Table S3). Moreover, the third codon positions of the mtCOI full gene and amplicon were found to be completely saturated.

Portiera lineages divergence dating
BEAST v2.5.2 was used to infer a Bayesian posterior consensus tree and the divergence time of the different nodes (Bouckaert et al., 2014). Evolutionary models for each of the pruned alignments were calculated with IQTREE (-m TESTONLY) (Nguyen et al., 2015). The best model for each alignment was selected based on the Bayesian Information Criterion. BEAUti was used to prepare the requited xml files using the pruned alignments and the obtained evolutionary models. Datasets were partitioned by gene. A lognormal relaxed clock with a Yule speciation process was selected as a prior based on a previous work (Santos-Garcia et al., 2015). The divergence between the subfamilies Aleyrodinae and Aleurodicinae (125-135 Mya) (Drohojowska and Szwedo, 2015) was used as a calibration point using a uniform distribution. Both datasets were first run under the priors to check their possible impact on the estimated dates. Then, four independent runs were conducted for 500 million generations, sampling every 50,000 generations and allowing a pre-burn-in of 1,000 generations. Convergence of the different runs was checked with Tracer v1.6, ensuring that at least an ESS larger than 200 was accomplished for each parameter estimated in each run. Log and Tree files were combined and trimmed with LogCombiner v2.5.2. Combined Tree files were used as input for TreeAnnotator v2.5.2 to obtain the topologies of the trees and their associated parameters, including divergence times and their confidence intervals. Trees were plotted with FigTree v1.4.3 (https://github.com/rambaut/figtree).

Portiera lineages molecular evolution
Synonymous (dS) and non-synonymous (dN ) substitution ratios and omega (ω) values were calculated in Codeml from PAML v4.7 package (Yang, 2007). To avoid redundancy, only one representative of the B. tabaci lineage, the MEAM1 species, was included in the analysis. After quality filtering, 232 of the 235 identified single-copy core OCPs were codon-aligned with MACSE v2.03 as described above and pruned with Gblocks v0.91b (-t=c -b5=n, no gaps allowed). Three branch models available in Codeml were computed: M0 (one ω), M1 (free ω ratios in each branch) and M2 (two ω, setting Portiera from B. tabaci and S. simplex as the foreground clade). All models were run using a constrained tree option that utilized the species tree obtained with OrthoFinder v2.3.3. Each model was run three times, keeping only the run with the largest likelihood. Likelihood ratio tests (LTR) were used to select the best fitting model, first by comparing model M1 against M2, and later against model M0. Bonferroni correction (two tests, p-value <= 0.025) was used for multiple testing. The pipeline described was implemented in a custom python script. Divergence times were obtained from the tree that was based on five Portiera genes ( Figure 1). These estimates were used to calculate the number of nucleotide substitutions per site per year (dS/t and dN/t) in each whitefly lineage. Before any statistical test, OCPs with dS, dN or ω values below percentile 1 or above percentile 99 were discarded. Also, OCPs with omega values greater than 10 were discarded. Raw and log-transformed data were tested for normality and homoscedasticity (Shapiro's and Levene's test, respectively). As dS/t, dN/t, and ω values were not normally distributed or homoscedastic, the nonparametric Kruskal-Wallis rank-sum and post-hoc pairwise Wilcoxon rank sum tests (with Benjamini & Hochberg correction) were used. All statistical analyses were performed in R.

Mitochondrion assembly, annotation, and molecular evolution
For accurate assembly of the mitochondrial genomes of S. simplex and P. mori, reads classified as "Insect" by Kraken2 were extracted and assembled with SPAdes v3.13.0 (-sc -careful). mtCOI s fragments obtained by PCR screening were used as a query for a BLASTN search against the assembly. Two contigs larger than 15Kb containing a full mtCOI gene were recovered. The first contig presented a full mtCOI gene with a nucleotide identity higher than 97% to P. mori (LR739216). The second contig harboring a full mtCOI gene had Acaudaleyrodes rachipora (LR739211) and B. reyesi (LR739074) as the two best hits. Further confirmation was conducted with the BOLD Identification System (Ratnasingham and Hebert, 2013). The first mtCOI full gene was classified again as P. mori (98.97% similarity to BOLD record GMESH030-14). The second mtCOI full gene was classified as S. simplex (100% similarity to BOLD record GBMHH7908-15). Both contigs were circularized with Gap5 and corrected with Pilon v1.23 (Walker et al., 2014) as described above. Mitochondrial genomes annotation was performed with MITOS v2 web server (http://mitos2.bioinf.uni-leipzig.de) setting the genetic code to Invertebrate. Annotations were downloaded and manually revised in Artemis v16 (Rutherford et al., 2000).
A. dispersus (KR063274), T. vaporariorum (NC 006280), and B. tabaci MED (MH205752) mitogenomes were downloaded from NCBI. Full genes were extracted from the downloaded and the newly assembled P. mori and S. simplex mitogenomes. Due to annotation differences, only 12 mitochondrial genes were shared between the different genomes. Extracted genes were codon-aligned with MACSE v2.03 (genetic code set up to 5) and pruned with Gblocks v0.91b. Saturation of the phylogenetic signal was assessed with DAMBE v7.2.3 as described above. Finally, pruned alignments were used as an input for Codeml to calculate dS, dN , and ω values. Since only one Aleurodicus species was included, the divergence of this lineage was set up to 129.35 Myr (the estimated time for the split between the Aleurodicinae and the Aleyrodinae families). Divergence times of the other four lineages were the same as described in the main text document. Statistical analyses and data cleaning were conducted as described above but this time the data were normally distributed and one-way ANOVA with Tukey's post-hoc tests were applied when required.  Figure S1: DNA polymerase III proofreading subunit (dnaQ) presence or absence using the two Portiera genes-based tree (the 16S and 23S rRNA genes). The ancestral state for each node was estimated using the ace function form the ape package (Paradis and Schliep, 2019) in R (R Core Team, 2018). Pie charts at the nodes represent the posterior probability for the presence (blue), absence (red), or unknown state (yellow). Filled circles at the tips represent dnaQ successful amplifications (blue), no amplification (red), or no data/untested (yellow).