Torix Rickettsia are widespread in arthropods and reflect a neglected symbiosis

Abstract Background Rickettsia are intracellular bacteria best known as the causative agents of human and animal diseases. Although these medically important Rickettsia are often transmitted via haematophagous arthropods, other Rickettsia, such as those in the Torix group, appear to reside exclusively in invertebrates and protists with no secondary vertebrate host. Importantly, little is known about the diversity or host range of Torix group Rickettsia. Results This study describes the serendipitous discovery of Rickettsia amplicons in the Barcode of Life Data System (BOLD), a sequence database specifically designed for the curation of mitochondrial DNA barcodes. Of 184,585 barcode sequences analysed, Rickettsia is observed in ∼0.41% of barcode submissions and is more likely to be found than Wolbachia (0.17%). The Torix group of Rickettsia are shown to account for 95% of all unintended amplifications from the genus. A further targeted PCR screen of 1,612 individuals from 169 terrestrial and aquatic invertebrate species identified mostly Torix strains and supports the “aquatic hot spot” hypothesis for Torix infection. Furthermore, the analysis of 1,341 SRA deposits indicates that Torix infections represent a significant proportion of all Rickettsia symbioses found in arthropod genome projects. Conclusions This study supports a previous hypothesis that suggests that Torix Rickettsia are overrepresented in aquatic insects. In addition, multiple methods reveal further putative hot spots of Torix Rickettsia infection, including in phloem-feeding bugs, parasitoid wasps, spiders, and vectors of disease. The unknown host effects and transmission strategies of these endosymbionts make these newly discovered associations important to inform future directions of investigation involving the understudied Torix Rickettsia.

The use of 'N' characters at double-peak sites could lead to potential problems in the interpretation of these 10 taxa at terminal branches of the phylogeny but the placement as Torix Rickettsia is not likely to be affected. Furthermore, these double-infections are a minority of the total taxa meaning their effects on interpreting results are likely to be minimal.
"Line 254-261: I think you need to add in a correction for multiple testing. You do at least two tests that are in the main text, but it sounds from the response to reviewer's comments that you did many more than that and are only reporting the ones that are P<0.05. However, you should report how many tests you did to find those two and adjust for multiple testing. Otherwise, if you do 20 tests, you would expect to have 1 that is "significant" (for more information on multiple testing: http://www.biostathandbook.com/multiplecomparisons.html). In addition, I think it is important for people to know what comparisons were done that were not significant as these are also results. Addressing multiple testing seems like an issue throughout. In the methods other statistical tests were clearly undertaken where there was multiple testing." Two Fisher's exact tests (aquatic vs terrestrial insects-1 controlled for insect order and 1 uncontrolled) were detailed in the main text and additional file 7, as these were the only taxonomically 'matched' pairs. However, one additional test was performed initially to compare terrestrial vs aquatic invertebrates in general which did not give a significant p-value due to a hotspot of Rickettsia in spiders, which are known to be a hotspot for all inherited symbionts tested to date ( . This detail has now been added in Additional file 7 and lines 266-269. Overall, only 3 tests were done (2 significant and 1 not significant) and this indicates that Torix Rickettsia are over-represented in aquatic insects but this may not be the case for invertebrates in general.
"Line 354: Several, maybe many, of the Wolbachia integrations have no mutations or frameshifts, particularly in insects. Those with frameshifts and mutations are easier to find and identify as integrations such that the number of integrations without frameshifts and mutations is likely an underestimate, particularly given how many groups are still screening Wolbachia sequences out before assembling insect genomes. I have no idea how often that happens with Rickettsia, but it seems like, particularly as more groups use tools like blobplots." We thank the reviewer for raising this issue. We have now put a caveat at the end of this sentence to indicate that despite no frameshifts or mutations, it is still possible the sequences from this study are host integrations (lines 376-378). The problem is likely to be less for Rickettsia than Wolbachia, due to differences in the mode of vertical transmission. Wolbachia is present in the germline stem cell niche, such DNA from the symbiont is available for incorporation into the germline. Rickettsia, in contrasts, usually invades the egg after meiosis, through the follicular epithelium. Thus, Rickettsia DNA is much less present in the germline of insects, making integration less likely.
"Line 366-379: This section still has issues with respect to the study design being secondary data analysis. These lines are in the discussion, it is the time to say things like on line 377 that the over-representation here in BOLD data (if that is the data you are referring to, because I can't remember which one was 17/19 and there is not here that clarifies) could be the result of an amplification bias-in not producing the host copy of the gene, amplifying the Rickettsia gene, or both. Those issues are profound in secondary data usage and need to be addressed head-on so that others who read the paper do not misconstrue the results. Likewise, the SRA data is not random, so I am not sure the statement on line 379-381 is correct, and at very least it needs qualifications. If it is correct, you need to better argue in the manuscript why it is correct, like that you used a sampling scheme to reduce bias, or something like that. Personally, I think it is better to acknowledge the limitations that try to justify, as even if you have a sampling scheme, it can be biased. The PCR screen listed in the table of this manuscripts seems biased from my quick look (e.g. an over-representation of mosquitoes)." The "17/19 strains" being Torix is a reference to the targeted screen (not the BOLD screen) which was used alongside the BOLD data because of the aforementioned biases relating to amplification bias and this has now been clarified on lines 401-402. Additionally, we have added a sentence to the results section explicitly quoting the 17 strains of Rickettsia found in the targeted screen (lines 248-250). Although 95% of Rickettsia amplifications from BOLD are Torix, we already mention that this is likely due to primer bias (lines 321-324). Subsequently, the targeted screen is used in part to negate the problems of relying entirely on secondary data. Of course, many studies which aim to investigate the distribution of a symbiont will have sampling and methodological biases. However, having multiple screening strategies, as we have here, is likely to give a more nuanced and holistic view of Torix Rickettsia ecology. We believe that the combined use of several screening methods is a strength and not a weakness of the study. Despite this, we have now added a separate section detailing the limitations of the study (lines 358-388).
Specifically, regarding lines 379-381 (of the 1st revision), this statement is based not just on SRA data but also the targeted screen from this study and Weinert's study as mentioned in the previous lines. Thus, the SRA is corroborating two separate targeted screens (one which lacked spiders and aquatic insects demonstrating a high number of Belli infections, and another which included spiders and aquatic insects demonstrating a high number of Torix infections.). Subsequently, for clarity we have now changed the statement "Our additional use of a bioinformatics approach based on the SRA appears to confirm that Belli and Torix are two of the most common Rickettsia groups among arthropods." to "Our additional use of a bioinformatics approach based on the SRA appears to corroborate targeted screen data indicating that Belli and Torix are two of the most common Rickettsia groups among arthropods." (lines 403-406).
"Line 387-388: please provide more details. I don't remember reading that. Pointing to an exact result, for instance of how many strains of the same MLST type are in different insect orders is necessary. It should have been in the results if it is in the discussion. In fact, if I look at the figures, the Wolbachia in Figure 2 actually seem to be grouped by insect host taxa at this level. The same is ture for Figure 3 for the Rickettsia. There are a few interleaved colors, but without knowing more I'm not convinced that it can't be explained in another way (like a mite on a host or in the gut from a carnivorous insect or even a double infection); I also can't tell which ones are identical and which ones are just similar. But even if I should infer it from the figures, it should be reported in the results and I didn't find it there. Maybe you are trying to state it in the subsequent sentence if one assumes that all blood feeders are the same taxa and all phloem-feeders are the same taxa, but that isn't clear. (And at least blood feeding is a trait found in multiple diverse taxa)." The inferences related to similar strains in distantly-related hosts is best observed in the multigene tree in figure 4 rather than the single gene trees of figures 2 and 3. For example, odonate strains are clearly interleaved between strains from other host orders. More specifically, the two Coenagrion strains have 100% identity to the Culicoides stigma strain in contrast to two other odoante (Polythore) strains where multiple SNPs are observed at all loci (See ftp file 'BOLD_multigene_Rickettsia_alignment.fas'). We thank the reviewer as this was not mentioned in the results but we have now included this on lines 209-211. Furthermore, regardless of exact MLST profiles for strains, taxa from most orders are represented in both Limoniae and Leech Torix subclades indicating a lack of grouping based on insect host taxa. The authors believe this concept is better represented in a phylogeny rather than a list of MLST profiles.
"And once again, I'm left wondering if there is a sampling bias. Are mosquitoes overrepresented in the database? It some of the tables they seem over-sampled. Blood feeders and phloem feeders are often well sampled, given their important to human health and agriculture, respectively. But maybe more problematically, these results are being described but they are not clearly described in the results section. If I search for blood, I do not find any results that support this statement. When I search for phloem, there is a mention of them being found in phloem-feeding insects, but not that they are diverse, and I have no way of assessing that as a reviewer or reader. Yet, this result for phloem-feeding is also in the abstract as a taxa that is a hot-spot. I don't see it in the figures in a way that I understand (e.g. phloem-feeding isn't annotated). Additionally, there is no assessment here that convinces me it is a hot spot; there are no statistics to suggest it is overabundant, which would be required to be a hotspot (and any such statistics would need correction for multiple testing or some sort of FDR calculation)." Mosquitoes are likely to be over-represented in the sequence read archive but as mentioned already in lines 142-144, a single dataset per species was extracted for analysis to negate oversampling of the same species. Although certain genera may still be oversampled, the only instance of mosquito Rickettsia being detected is in the Anopheles plumbeus population of the targeted screen. With regards to phloemfeeding insect strains, psyllids and other phloem-feeders are present in both Limoniae and Leech subclades suggesting again that strains are diverse within similar lifestyles. This is best seen in Figure 6 where both phloem-feeding and blood-feeding are annotated. The common patterns of infection in phloem-feeding bugs and bloodfeeders are also already mentioned in lines 296-302 of the results. We agree that the common patterns or 'hot-spots' found in our data should come with caveats and we have now included this in the limitations part of the discussion where we clarify that common patterns of infection refer specifically to our datasets which although extensive, have some biases and may not completely represent Torix Rickettsia infection in nature (lines 358-371). ""as previously described" Is this in a different manuscript, or earlier in the manuscript? I suspect this means earlier in the paper, but it needs to be clear. Was the same alignment method used with MAFFT and Gblocks? Same for ModelFinder? If it is and all ML trees were inferred the same way, I would recommend that you have a methods section that describes this one for all alignments, maybe concluding with what is different (like the model)." This refers to methods described earlier in the manuscript and yes, the same methods were used for all ML trees. This suggestion is welcomed and has been included in lines 492-493.
"I've outlined some examples where the statements don't reflect what is presented in the results and the limitations of a secondary data analysis. But they are actually more numerous and pervasive than this. For instance, line 39-41 and 41-43 in the abstract have these issues. Likewise, Line 161-162 should read "Torix Rickettsia is the most common bacterial contaminant sequence currently in BOLD, a major barcoding project". This change reflects that this only holds for what has been barcoded thus far, and the issues with the fact you need both failed host amplification and successful bacterial amplification, and that the biodiversity represented in such projects have their own biases. It is so pervasive, I am not sure I found all the instances. Honestly, I think the paper would really benefit from a large clearly labeled section that more explicitly deals with all the limitations of the study, so others do not misconstrue the results for years into the future. It would make the paper much stronger and definitely more rigorous." With regard to line 39-41 describing how our targeted PCR data supports the aquatic hotspot hypothesis we refer the reviewer to our response to the 'multiple testing' above. For lines 41-43, we have changed this sentence to include the caveat that this applies only to arthropod genome projects: "Furthermore, the analysis of 1,341 Sequence Read Archive (SRA) deposits indicates Torix infections represent a significant proportion of all Rickettsia symbioses found in arthropod genome projects." We have also changed lines 161-162 to reflect a similar caveat for the BOLD data as suggested by the reviewer. As previously mentioned, we have now included a specific section in the discussion detailing the limitations of our datasets (lines 358-388).  It is now widely recognized that animals live in a microbial world, and that many aspects of 57 animal biology, ecology and evolution are a product of their symbioses with microorganisms 58 [1]. In invertebrates, these symbioses may be particularly intimate, and involve transmission 59 of the microbe from parent to offspring [2]. The alignment of host reproduction with symbiont 60 transmission produces a correlation between the fitness interests of the parties, reflected in 61 symbionts evolving to play a number of physiological roles within the host, from defence [3,4] 62 through to core anabolic and digestive functions [5,6]. However, the maternal inheritance of 63 these microbes has led to the retention of parasitic phenotypes associated with distortion of 64 reproduction, with symbiont phenotypes including biases towards daughter production and 65 cytoplasmic incompatibility [7]. These diverse individual impacts alter the ecology and 66 evolution of the host, in terms of diet, dynamics of interaction with natural enemies, sexual 67 selection and speciation.  In this paper, we use three approaches to reveal the diversity and commonness of Torix forest detritivores (e.g. Sciaridae; Mycetophilidae; Staphylinidae) and phloem-feeding bugs 305 (Psyllidae; Ricaniidae). Feeding habits such as phloem-feeding, predation, detritivory or 306 haematophagy were not correlated with any particular Torix Rickettsia subclade ( Figure 6). 307 Furthermore, parasitoid and aquatic lifestyles were seen across the phylogeny. All newly 308 discovered putative Torix Rickettsia host taxa are described in Table 2  (compared to the targeted screen which contains multiple aquatic insect species) could be 369 due to this sampling bias. Likewise, the over-representation of Torix Rickettsia from BOLD is 370 likely due to an amplification bias as a result of higher primer site homology to that particular 371 group from commonly used barcoding primer sets. Subsequently, the common patterns of 372 infection (or 'hot spots') found in this study are identified as such with these provisos in mind. 373 To counteract these biases and to give a more nuanced and holistic view of Torix Rickettsia 374 ecology, a targeted screen was also included to ensure this study was not over-reliant on 375 secondary data. Model-based estimation techniques suggest Rickettsia are present in between 20-42% of 395 terrestrial arthropod species [12]. However, the targeted PCR screen in this study gave an 396 estimated species prevalence of 8.9% for terrestrial species. This discrepancy is likely due to 397 targeted screens often underestimating the incidence of symbiont hosts due to various 398 methodological biases including small within-species sample sizes (missing low-prevalence 399 infections) [29]. Importantly, the inclusion and exclusion of specific ecological niches can also Closely related Rickettsia are also found in putative hosts from different niches and habitats -419 for instance, the Rickettsia strains found in terrestrial blood feeders do not lie in a single clade, 420 but rather are allied to strains found in non-blood feeding host species. Likewise, strains in 421 phloem-feeding insects are diverse rather than commonly shared.   Aside from phylogenetic placement of these Rickettsia-containing samples, attempts were 535 made to extract an mtDNA barcode from these taxa in order to identify the hosts of infected 536 specimens. This is because morphological taxonomic classification of specimens in BOLD is 537 usually only down to the order level before barcoding takes place. Previous non-target 538 amplification of Rickettsia through DNA barcoding of arthropod DNA extracts had occurred in 539 the bed bug Cimex lectularius, with a recovery of the true barcode after using the primer set 540 C1-J-1718/HCO1490, which amplifies a shortened 455 bp sequence within the COI locus. 2) Genus level designation for at least 95% sequence identity. 560 3) Family level designation for at least 85% sequence identity. 561 Additionally, all sequences were required to be at least >200 bp in length. 562 563

Assessment of barcoding success 564
One of the factors determining a successful COI bacterial amplification is the initial failure of 565 an extract to amplify mtDNA. Subsequently, to determine the likelihood of this event within 566 taxa, we used the 55,366 specimen representative data subset [37] to evaluate failure rates. 567 To this end, all orders of host which gave at least one non-target Rickettsia COI hit were 568 assessed. The barcoding success rate was determined as the proportion of specimens which 569 matched initial morphotaxa assignment and were not removed after BOLD quality control 570 [38]. As the total Rickettsia count was from a larger dataset than the one made available, an 571 adjusted infection frequency for each taxon was calculated based on the representative data 572 subset. It is known that there are taxonomic hot spots for endosymbiont infection, with for instance 589 spiders being a hot spot for a range of microbial symbionts [43]. Therefore, analyses were 590 performed that were matched at a taxonomic level (i.e. each taxon was represented in both 591 the aquatic and terrestrial pools). To this end, the incidence of Torix Rickettsia was first 592 compared in all insects. However, within insects, there is taxon heterogeneity between 593 aquatic and terrestrial biomes (e.g. Ephemeroptera, Plecoptera in aquatic only, Lepidoptera 594 in terrestrial only). The analysis was therefore narrowed to match insect orders present in 595 both the aquatic and terrestrial community. Three insect orders, Hemiptera, Diptera and 596 Coleoptera, fulfilled this criterion with good representation from each biome. For each case, 597 the ratios of the infected:non-infected species between aquatic and terrestrial communities 598 Alignments and trees are also available from the GigaScience GigaDB repository [96].