-
PDF
- Split View
-
Views
-
Cite
Cite
Emmanuel F A Toussaint, Jérémy Gauthier, Julia Bilat, Conrad P D T Gillett, Harlan M Gough, Håkan Lundkvist, Mickael Blanc, Carlos P Muñoz-Ramírez, Nadir Alvarez, HyRAD-X Exome Capture Museomics Unravels Giant Ground Beetle Evolution, Genome Biology and Evolution, Volume 13, Issue 7, July 2021, evab112, https://doi.org/10.1093/gbe/evab112
- Share Icon Share
Abstract
Advances in phylogenomics contribute toward resolving long-standing evolutionary questions. Notwithstanding, genetic diversity contained within more than a billion biological specimens deposited in natural history museums remains recalcitrant to analysis owing to challenges posed by its intrinsically degraded nature. Yet that tantalizing resource could be critical in overcoming taxon sampling constraints hindering our ability to address major evolutionary questions. We addressed this impediment by developing phyloHyRAD, a new bioinformatic pipeline enabling locus recovery at a broad evolutionary scale from HyRAD-X exome capture of museum specimens of low DNA integrity using a benchtop RAD-derived exome-complexity-reduction probe set developed from high DNA integrity specimens. Our new pipeline can also successfully align raw RNAseq transcriptomic and ultraconserved element reads with the RAD-derived probe catalog. Using this method, we generated a robust timetree for Carabinae beetles, the lack of which had precluded study of macroevolutionary trends pertaining to their biogeography and wing-morphology evolution. We successfully recovered up to 2,945 loci with a mean of 1,788 loci across the exome of specimens of varying age. Coverage was not significantly linked to specimen age, demonstrating the wide exploitability of museum specimens. We also recovered fragmentary mitogenomes compatible with Sanger-sequenced mtDNA. Our phylogenomic timetree revealed a Lower Cretaceous origin for crown group Carabinae, with the extinct Aplothorax Waterhouse, 1841 nested within the genus Calosoma Weber, 1801 demonstrating the junior synonymy of Aplothorax syn. nov., resulting in the new combination Calosoma burchellii (Waterhouse, 1841) comb. nov. This study compellingly illustrates that HyRAD-X and phyloHyRAD efficiently provide genomic-level data sets informative at deep evolutionary scales.
Introduction
Significance
Museum specimens housed in natural history collections represent a tremendous wealth of genetic information that is not yet fully unlocked because their DNA can be highly degraded and difficult to sequence. The recently introduced HyRAD-X method alleviates this impediment by allowing the capture of thousands of loci from the genome of both adequately preserved and old Museum specimens. In this study, we developed a new phyloHyRAD bioinformatics pipeline and used it to produce the first robust, dated phylogenomic backbone for Carabinae giant ground beetles. Our new method makes it possible to perform DNA read alignments at a deep evolutionary scale and to retrieve phylogenomic information from even very old specimens (>50 years), providing a useful tool for future systematic and evolutionary work.
The major drawback when applying genomic techniques to museum specimens is the generally poor preservation inherent in their constituent DNA templates (Colella et al. 2020; Jin et al. 2020), hindering or precluding application of standard high-throughput sequencing methods. Because DNA extracted from these specimens is often degraded and highly fragmented, approaches relying upon genomic complexity reduction (i.e., reducing the portion of the genome that is studied) are only applicable provided that certain initial conditions, such as mild levels of DNA fragmentation (with DNA molecules >5 kb remaining in the extract), are met (Tin et al. 2014; Burrell et al. 2015). Because such methods further fragment DNA molecules having already been fragmented through postmortem endonuclease activity, a minimum amount of starting DNA comprising a significant proportion of long fragments is crucial to allow for the greatest possible number of restriction sites to be conserved across all samples. However, for many old or otherwise deteriorated samples, these conditions will not be met, and consequently, methods such as RADseq (Miller et al. 2007) will often be inapplicable. Alternatively, employing hybridization capture of target DNA using liquid- or solid-phase probes (Carpenter et al. 2013) allows access to molecules of low integrity typical of those present in historical specimens, that is, those that are too short and/or degraded to be retrieved from a sample using conventional methods. Most such methods rely upon commercially synthesized probes, applying either exome capture (Bi et al. 2013) or capture of a subset of loci (Jones and Good 2016). A broad catalog of possible genomic targets can permit exploration of a wide array of evolutionary scales. For instance, UCEs are well-suited for inferring relationships from suprageneric (e.g., Wood et al. 2018) to interspecific levels (e.g., Manthey et al. 2016), whereas capture methods based on RADseq loci can reconstruct phylogeographic scenarios or population processes at the intraspecific level (e.g., Ali et al. 2016; Hoffberg et al. 2016). More seldomly, RADseq approaches have enabled reconstruction of evolutionary patterns at deeper scales (e.g., Leaché et al. 2015).
Among the latter techniques is Hybridization Capture Using RAD Probes (HyRAD, Suchan et al. 2016), which in contrast to all other hybridization capture methods based on RADseq loci that require prior knowledge of genome sequences, does not require commercially synthesized probes, instead relying only on probes that can be synthesized within researchers’ laboratories. For this method, the probes are built through a double-digest RADseq (Peterson et al. 2012) protocol using either a high-integrity DNA template from fresh specimens of the target species (i.e., HyRAD sensu stricto) or a messenger RNA template (i.e., HyRAD-X, Schmid et al. 2017). So far, both methods have proven reliable in uncovering evolutionary processes at the intraspecific level using both historical (Schmid et al. 2018; Gauthier et al. 2020) and ancient samples (Schmid et al. 2017). However, their application at broader evolutionary scales in a phylogenetic framework has not yet been evaluated. Because HyRAD-X uses a messenger RNA template to design the double-digestion RAD-sequencing (ddRAD)-based probes onto which DNA of the specimens of interest will hybridize, it should theoretically perform similarly to an exome capture approach in producing data suitable for phylogenetic reconstruction. A major advantage being that prior acquisition of commercially synthesized probes representative of the target lineage is unnecessary. Instead, production of a set of probes from a limited number of fresh specimens spanning the expected phylogenetic diversity of the target clade should sufficiently encompass the extant diversity of that clade to enable capture of all fragments homologous to the reduced representation in the messenger RNA template. Bioinformatic treatment of HyRAD data has been made more accessible thanks to the release of popHyRAD, a suite of scripts dedicated to treating HyRAD data at a shallow evolutionary scale (Gauthier et al. 2020). However, to analyze HyRAD-X data at deeper evolutionary scales, several modifications should be implemented, such as alignment of genomic data to an exomic template, the merging of reads aligned to different probes into a single alignment encompassing all taxa, and the output of final alignments as well as single nucleotide polymorphism (SNP) data.
The cosmopolitan ground beetle subfamily Carabinae (Insecta, Coleoptera, Carabidae) is a useful model with which to apply and test the versatility of HyRAD-X. Macroevolutionary trends in this group, for example pertaining to the evolution of wing morphologies (i.e., apterous, brachypterous, macropterous, see Imura et al. 2018; Toussaint and Gillett 2018), are currently obscured by the lack of a robust phylogenetic framework and divergence times. This group comprises ca. 1500 species distributed in four tribes: the Holarctic Cychrini (four genera, ca. 230 species), the South American Ceroglossini (one genus, ca. 10 species), the Australasian Pamborini (two genera, ca. 20 species), and the cosmopolitan Carabini (two genera, ca. 1100 species) (Jiroux 2006; Takami and Sota 2006; Cavazzuti 2010; Bruschi 2013; Deuve 2019). Furthermore, the monotypic genus Aplothorax Waterhouse, 1841 endemic to the Atlantic island of St Helena, considered extinct since the 1970s (Gray et al. 2019), is classified within the subfamily but without a robust tribal placement (Jeannel 1940; Prüser and Mossakowski 1998; Bruschi 2013; but see Sota et al. 2020). In addition to ranking among the most popular of beetles owing to their often elegant and attractive appearance (Fabre 1901; Chatenet 1986), the Carabinae include some of the largest and bulkiest ground beetles, including species in the subgenus Carabus (Procerus) Dejean, 1826 that attain 6 cm in length (Cavazzuti 1989), and can be termed ‘giant ground beetles’ accordingly. This subfamily was the focus of early studies during the dawn of molecular phylogenetics (Su et al. 1996a, 1996b); all clades having since received attention from phylogenetic studies incorporating Sanger sequencing (Kim, Tominaga, et al. 2000; Kim, Zhou, et al. 2000; Takami 2000; Tominaga et al. 2000; Okamoto et al. 2001; Sota and Vogler 2001, 2003; Su et al. 2003, 2005; Osawa et al. 2004; Sota and Ishikawa 2004; Su, Imura, Okamoto, Kim, et al. 2004; Su, Imura, Okamoto, and Osawa 2004; Sota et al. 2005; Andújar et al. 2012a; Andújar et al. 2012b; Deuve et al. 2012; Deuve and Faille 2013; Muñoz-Ramírez 2015; Muñoz-Ramírez et al. 2016; Imura et al. 2018; Toussaint and Gillett 2018; Sota et al. 2020). An early attempt to employ RAD-sequencing within the genus Carabus L., 1758 aimed to construct a molecular phylogeny of the charismatic subgenus Carabus (Chysocarabus) Thomson, 1875, indicating that a potential strength of such an approach is the inference of relationships at the intrageneric level (Cruaud et al. 2014). This approach was successfully pursued for phylogenetic inference within Japanese Carabus (e.g., Takahashi et al. 2014; Fujisawa et al. 2019). However, very few studies have focused on intertribal relationships (but see Osawa et al. 2004; Toussaint and Gillett 2018; Sota et al. 2020). Based upon preliminary analyses of Sanger sequencing or mitogenome data, the snail-eating beetles (Cychrini) have been shown to be sister to the rest of the subfamily, followed by Ceroglossini and Pamborini, which together form an apparent Gondwanan stock, that is in turn sister to Carabini (comprising the diverse genera Carabus and Calosoma Weber 1801) (Imura et al. 2018; Toussaint and Gillett 2018; Sota et al. 2020). Jeannel (1940) hypothesized that Aplothorax is a close relative of the Afrotropical subgenus Calosoma (Ctenosta) Motschulsky, 1865 and this placement was recently supported by analysis of mitogenomes (Sota et al. 2020). The genus Calosoma is unique in this subfamily of predominantly brachypterous flightless species, for containing a significant proportion of fully winged and flight-capable species (Bruschi 2013). Charles Darwin himself observed the impressive flight capabilities of Calosoma when he saw them flying onto the H.M.S. Beagle whilst many miles off the South American coast (Darwin 1845). Major obstacles to obtaining a robust and comprehensive phylogeny of Carabinae include the substantial species richness of the group, the widespread geographic distribution of constituent clades across diverse latitudes and elevations, combined with the extreme localization of many species.
Although reliable fossils of Carabinae exist (e.g., Nel 1987, 1988; Deuve 1998; Farrés and Altimiras 2012; Yahiro et al. 2018; Kirejtshuk et al. 2019), the age of the subfamily has been contentious, since the early work of Jeannel (1940) to recent times (Andújar et al. 2012a; Andújar et al. 2014; Toussaint and Gillett 2018; Opgenoorth et al. 2020; Sota et al. 2020). This uncertainty stems from the disputed sister group to Carabinae, an absence of reliable fossils outside of the tribe Carabini, and the frequent past use of biogeographic constraints in clock calibrations, which have likely biased results toward younger divergence times. Current estimates for the crown age of Carabinae range between as recent as ca. 35 Ma (Andújar et al. 2012a; Andújar et al. 2014) to as old as ca. 170 Ma (Toussaint and Gillett 2018). Larger-scale phylogenomic studies of beetles have also provided insights into the evolution of Carabinae. For instance, the most comprehensive attempt at unraveling the evolutionary history of beetles to date (McKenna et al. 2019) recovered an age of ca. 35 Ma for the crown of Calosoma and Carabus, and an age of ca. 120 Ma for the stem (i.e., the split with Elaphrus [Carabidae, Elaphrinae] in that study). However, the placement of Carabinae within ground beetles remains controversial in light of recent phylogenomic work (Gustafson et al. 2020) and our understanding of Carabinae evolutionary history is therefore hampered by the lack of a robust phylogeny and requires an in-depth phylogenomic reappraisal. Considering that giant ground beetles were some of the first nonmodel insects to have been studied using molecular phylogenetic approaches, now seems a timely opportunity to develop new methods employing a modern toolkit, to further unravel the evolutionary history of these charismatic beetles.
We apply HyRAD-X to a representative set of giant ground beetles and infer divergence times at an unprecedented evolutionary scale compared with previous museomics studies based on a set of RAD-like loci. We analyze RAD-like loci obtained from 48 specimens selected to represent all recognized genera of Carabinae and recently inferred clades within the genus Calosoma (Toussaint and Gillett 2018). We also introduce a novel, dedicated bioinformatic pipeline, phyloHyRAD, which complements the recently published popHyRAD suite (Gauthier et al. 2020), allowing for construction of a reference catalog, the alignment of reads, and the identification of individual sequences in a phylogenetic framework. Our aims are 1) to generate a new exome capture toolkit to infer the evolutionary history of Carabinae at different taxonomic scales, from intersubfamilial to intrageneric relationships; 2) to introduce a new bioinformatic pipeline allowing the optimization of data recovery from HyRAD-X raw reads in a phylogenetic context (fig. 1); 3) to infer a robust phylogenomic backbone for Carabinae using the resulting HyRAD-X output combined with external data produced through RNAseq and UCE approaches; and 4) to estimate divergence times among giant ground beetles.
Results and Discussion
Genomic Data Recovery
After DNA extraction of 48 specimens, we applied the HyRAD-X protocol, and two independent Illumina sequencing runs to obtain a mean of 15,737,206 reads for each sample (SD 19,657,890). After cleaning of these raw reads (see below), 96.4% (SD 5.2%) of reads were retained (supplementary table S1, Supplementary Material online). These results highlight considerable heterogeneity in the amount of genetic information recovered across samples. From the initial sampling, 13 specimens produced a very low number of reads and recovered loci, with a correspondingly low genomic coverage (usually <5 kb in total) (fig. 2) preventing their inclusion in the final analyses, although in preliminary testing they were recovered in their correct tribes (e.g., Cychrus morawitzi Gehin, 1863, Calosoma atrovirens Chaudoir, 1869, Calosoma burtoni Alluaud, 1913, and Calosoma chinense Kirby 1818). Overall, 35 out of 48 sequenced taxa were included in the final analyses, covering all currently recognized genera except for the Nearctic genus Scaphinotus Dejean, 1826, whose phylogenetic placement within Cychrini remains doubtful (Su et al. 2004; Imura et al. 2018; Toussaint and Gillett 2018; Sota et al. 2020).
For the mapping step, although BWA-ALN is the recommended tool when analyzing ancient DNA (Schubert et al. 2012), we evaluated the influence of the mapping tools by also using a second, more flexible approach, namely BWA-MEM (Li 2013). Indeed, at deep evolutionary scales, where the genetic distance between the probes and the captured samples can sometimes be significant, and which is a crucial parameter for both the capture process and the mapping onto reference probes, a higher flexibility in mapping seems to be essential to retrieve sufficient genetic information. We observed a higher percentage of mapped reads using BWA-MEM (34.9% [SD 12.9%] before cleaning and 4.6% [SD 2.4%] after cleaning) than when using BWA-ALN (2.6% [SD 2.8%]) of mapped reads before cleaning and 0.2% [SD 0.1%]) after cleaning) (details in supplementary table S1, Supplementary Material online). Differences in percentages of mapped reads affect the mean coverage (supplementary fig. S1, Supplementary Material online) and although lower coverage still allows recovery of some loci, these are characterized by increased missing data. The mean number of retrieved loci obtained by each mapping method is similar, that is, 1,398 loci with BWA-MEM (SD 816) (supplementary fig. S1, Supplementary Material online) and 1,178 loci with BWA-ALN (SD 808) (supplementary table S1, Supplementary Material online), although the proportion of missing data is higher with BWA-ALN (table 1). This is particularly reflected in the number of Parsimony-informative sites (i.e., those with a minor-frequency allele shared by at least two specimens) present in the final alignments, where five times more informative sites were present in the BWA-MEM derived alignment (57,285 sites) than in that obtained with BWA-ALN (9,922 sites) (table 1). This variation explains differences in resolution and branch support in the phylogeny across analyses (fig. 3).
Alignment Statistics for Each Data Set, Including the Alignment Length, the Percentage of Missing Data, the Numbers, and Percentages of Variable Sites and of Parsimony Informative Sites, and the GC Content
Data Set . | Mapping Tool . | No. of Taxa . | Loci . | Alignment Length . | Missing Data . | Variable Sites . | Parsimony Informative Sites . | GC Content . |
---|---|---|---|---|---|---|---|---|
A | BWA-MEM | 43 | 2,945 | 474,696 | 67.9% | 102,929 (21.7 %) | 57,387 (12.1 %) | 0.456 |
B | BWA-MEM | 40 | 2,945 | 474,696 | 65.7% | 102,802 (21.7 %) | 57,285 (12.1 %) | 0.455 |
C | BWA-ALN | 40 | 1,227 | 222,732 | 80.7% | 24,772 (11.1 %) | 9,922 (4.5 %) | 0.465 |
D | BWA-MEM | 43 | 100 | 27,749 | 56.5% | 6,946 (25.0%) | 4,177 (15.1%) | 0.448 |
Data Set . | Mapping Tool . | No. of Taxa . | Loci . | Alignment Length . | Missing Data . | Variable Sites . | Parsimony Informative Sites . | GC Content . |
---|---|---|---|---|---|---|---|---|
A | BWA-MEM | 43 | 2,945 | 474,696 | 67.9% | 102,929 (21.7 %) | 57,387 (12.1 %) | 0.456 |
B | BWA-MEM | 40 | 2,945 | 474,696 | 65.7% | 102,802 (21.7 %) | 57,285 (12.1 %) | 0.455 |
C | BWA-ALN | 40 | 1,227 | 222,732 | 80.7% | 24,772 (11.1 %) | 9,922 (4.5 %) | 0.465 |
D | BWA-MEM | 43 | 100 | 27,749 | 56.5% | 6,946 (25.0%) | 4,177 (15.1%) | 0.448 |
Alignment Statistics for Each Data Set, Including the Alignment Length, the Percentage of Missing Data, the Numbers, and Percentages of Variable Sites and of Parsimony Informative Sites, and the GC Content
Data Set . | Mapping Tool . | No. of Taxa . | Loci . | Alignment Length . | Missing Data . | Variable Sites . | Parsimony Informative Sites . | GC Content . |
---|---|---|---|---|---|---|---|---|
A | BWA-MEM | 43 | 2,945 | 474,696 | 67.9% | 102,929 (21.7 %) | 57,387 (12.1 %) | 0.456 |
B | BWA-MEM | 40 | 2,945 | 474,696 | 65.7% | 102,802 (21.7 %) | 57,285 (12.1 %) | 0.455 |
C | BWA-ALN | 40 | 1,227 | 222,732 | 80.7% | 24,772 (11.1 %) | 9,922 (4.5 %) | 0.465 |
D | BWA-MEM | 43 | 100 | 27,749 | 56.5% | 6,946 (25.0%) | 4,177 (15.1%) | 0.448 |
Data Set . | Mapping Tool . | No. of Taxa . | Loci . | Alignment Length . | Missing Data . | Variable Sites . | Parsimony Informative Sites . | GC Content . |
---|---|---|---|---|---|---|---|---|
A | BWA-MEM | 43 | 2,945 | 474,696 | 67.9% | 102,929 (21.7 %) | 57,387 (12.1 %) | 0.456 |
B | BWA-MEM | 40 | 2,945 | 474,696 | 65.7% | 102,802 (21.7 %) | 57,285 (12.1 %) | 0.455 |
C | BWA-ALN | 40 | 1,227 | 222,732 | 80.7% | 24,772 (11.1 %) | 9,922 (4.5 %) | 0.465 |
D | BWA-MEM | 43 | 100 | 27,749 | 56.5% | 6,946 (25.0%) | 4,177 (15.1%) | 0.448 |
Data from the eight taxa obtained from external repositories were included in the final alignments. Mapping of these data revealed a low percentage of mapped reads: a mean of 6.8% (SD 3.2%) before cleaning and 1.3% (SD 0.9%) after cleaning with BWA-MEM, and a mean of 0.4% (SD 0.4%) before cleaning and 0.1% (SD 0.1%) after cleaning with BWA-ALN (supplementary table S1, Supplementary Material online). This was expected because these data resulted from approaches such as RNA-Seq or UCEs that do not specifically target the loci captured by our HyRAD-X approach. Despite these relatively low mapping levels, locus recovery from the raw reads for these taxa remained excellent, with a mean recovery of 1,788 loci (SD 868) with BWA-MEM and 530 loci (SD 515) with BWA-ALN.
The effect of missing data on phylogeny reconstruction has been intensively studied for nearly three decades (Novacek, 1992; Wiens, 1998; Dell’Ampio et al. 2014; Sann et al. 2018), including studies restricted to the application of RADseq-type approaches (Rubin et al. 2012; Hosner et al. 2016; Huang and Knowles 2016; Crotti et al. 2019). Most studies agree that data sets containing a certain level of missing data, even of the order of 50% or 60%, can still provide useful phylogenetic information. On the other hand, overly permissive strategies, that include loci consisting of more than 80% or 90% missing data, have revealed suboptimal results (Crotti et al. 2019). In addition, the inclusion of sequences that are too short may also prove unpredictable, due to gene trees that are literally too short to allow optimization of the species tree (Hosner et al. 2016). However, Streicher et al. (2016) have shown that in concatenation approaches, optimality can be achieved despite a relatively high level of missing data, provided that the taxon sampling is high.
Our most well-supported result was achieved using the most flexible aligner (BWA-MEM), which yielded both the highest number of retrieved loci and the lowest level of missing data.
Phylogenetic Inference of Giant Ground Beetles
Results of the maximum likelihood phylogenomic inferences are summarized in figure 3 (see supplementary figs. S2–S6, Supplementary Material online for more details). The phylogenetic resolution and associated branch support across the topologies differed greatly according to the type of loci (BWA-ALN vs. BWA-MEM) and the taxon sampling. The reduced number of loci obtained using BWA-ALN (Data set C) resulted in lower branch support and inconsistent phylogenetic relationships compared with other analyses (fig. 3). Similarly, including taxa represented by low coverage resulted in decreased branch support (fig. 3, Data set A). Removing the outgroup Tetracha carolina (Linnaeus, 1763), represented by low loci coverage (206 loci with BWA-MEM, fig. 3), resulted in strong branch support across the backbone of the tree (fig. 3, Data set B). However, RogueNaRok analyses did not reveal a high number of rogue taxa nor did they identify T. carolina as such. Only Calosoma deckeni (Gerstaeker, 1867) in Data set A and Carabus nemoralis Müller, 1764 in Data set E were identified as potential rogue taxa (see supplementary table S3, Supplementary Material online for detailed results of the RogueNaRok analyses). Finally, the phylogenetic hypothesis derived from the analysis of Data set D (fig. 3) is largely compatible with the one derived from the most comprehensive analysis of Data set A despite being based on a substantially lower number of loci. Below we summarize and discuss the results of the analysis based on Data set B, which we consider the most robust inference.
We recover Carabinae as monophyletic and divided into three main clades, in line with previous phylogenetic reconstructions (Imura et al. 2018; Toussaint and Gillett 2018; Sota et al. 2020). The first clade corresponds to the tribe Cychrini, wherein Sphaeroderus Dejean, 1831 is recovered as sister to Cychrus Fabricius, 1794 and Cychropsis Boileau, 1901, with strong branch support. The missing genus Scaphinotus could not be included in our final analyses and its placement within Cychrini remains uncertain (Su et al. 2004; Toussaint and Gillett 2018). The second clade comprises the tribes Ceroglossini and Pamborini that are recovered as sister clades with strong branch support. Within Pamborini, we recover Maoripamborus Brookes, 1944 as sister to Pamborus Latreille, 1817, with strong branch support. The third clade includes the tribe Carabini, with the subtribes Carabina and Calosomatina recovered as reciprocally monophyletic, with strong branch support.
Within Carabus, we recover three main clades. The first one comprises the subgenera Carabus (Hygrocarabus) Thomson, 1875 and Carabus (Platycarabus) Morawitz, 1886, as proposed by Deuve et al. (2012), who also recovered Carabus (Chaetocarabus) Thomson, 1875 (not included in this study) within it (i.e., the Arcifera group sensu Deuve 2019). We then recover a clade comprising Carabus (Orinocarabus) Kraatz, 1878 and Carabus (Tanaocarabus) Reitter, 1896 as sister to the remainder of Carabus. Although our taxon sampling is limited in this highly diverse genus, which contains almost 100 subgenera (Deuve 2019), the strongly supported backbone we infer will serve as a basis for future phylogenetic work.
Our analyses indicate with strong branch support, and in concordance with Sota et al. (2020), that Aplothorax is a lineage nested within the cosmopolitan genus Calosoma. The former genus therefore becomes a junior subjective synonym of the latter, Aplothorax Waterhouse, 1841 syn. nov. of Calosoma Weber, 1801, thereby creating the new combination: Calosoma burchellii (Waterhouse, 1841) comb. nov. This placement confirms Jeannel’s (1940) early hypothesis based on morphological features. Further studies are needed to understand the placement of this unique lienage with respect to the winged species of the subgenus Calosoma (Ctenosta). Within Calosoma, our phylogenomic inference recovers the same clades revealed in Toussaint and Gillett (2018) despite a reduced taxon sampling, but with strong branch support across the backbone and the main clades. However, our results contradict, to some extent, the results of Sota et al. (2020) who studied mitogenomes across Carabinae and recovered a different hypothesis within Calosoma, albeit with lower branch support. Our results provide a robust backbone for this genus, intimating that taxonomic reappraisal is necessary. The phylogenomic tree presented in this study confirms multiple gains and losses of wings across Calosoma, which likely resulted from its complex and dynamic biogeographical history (fig. 4).
Evolutionary History of Carabinae Beetles
Our BEAST analyses all converged successfully, with effective sample size values well above 200 for all estimated parameters. Overall, divergence times broadly overlapped between analyses using different clock and tree priors (table 2). Comparison of marginal likelihoods reveals that the analysis based upon seven Bayesian uncorrelated lognormal relaxed clocks and a Yule tree model was best supported, although support was not significantly better than with the same number of clocks and a birth–death model (table 2). We present the results of this analysis in figure 4 and discuss them hereafter (see supplementary fig. S7, Supplementary Material online for more details). We recover a split between tiger beetles (Cicindelidae) and ground beetles in the Lower Cretaceous ca. 140 Ma. The crown age of Carabinae is estimated to lie in the Lower Cretaceous, ca. 110 Ma (95% HPD = 92–130 Ma). Following that split, we recover a crown age for Cychrini of ca. 55 Ma (95% HPD = 42–72 Ma). We infer a split between Carabini and the Ceroglossini+Pamborini clade ca. 100 Ma (95% HPD = 58–85 Ma), at the Lower to Upper Cretaceous boundary. The split between Ceroglossini and Pamborini is estimated to have occurred ca. 80 Ma (95% HPD = 67–99 Ma), suggesting a possible vicariant event between the faunas of South America and Australia, which could be linked to the glaciation of Antarctica. The split between Maoripamborus and Pamborus is dated at ca. 70 Ma (95% HPD = 52–83 Ma), which is consistent with the hypothesis of vicariance between Australia and Zealandia (Schellart et al. 2006; Neall and Trewick 2008). The split between Calosoma and Carabus is also dated at ca. 70 Ma (95% HPD = 58–85 Ma), whilst the crown ages of Carabus and Calosoma are, respectively, dated at ca. 50 Ma (95% HPD = 40–59 Ma) and ca. 40 Ma (95% HPD = 33–48 Ma). Without denser sampling of the few Nearctic Carabus species, it is difficult to discuss the alternative biogeographic hypotheses that led to the colonization of this region from the Palearctic. Within Calosoma, most splits cannot be explained by vicariant events, and therefore the biogeographical history of the genus must have been very dynamic, with multiple range expansion and dispersal events at continental scales. The colonization of St Helena by the ancestors of Aplothorax took place between the Oligocene and Miocene, most likely out of Africa, where the closely-related and fully winged subgenus Calosoma (Ctenosta) occurs. The oldest rocks on that remote oceanic island of volcanic origin have been dated at ca. 14.5 Ma (Baker 1973), which is in line with our estimates. Considering the remoteness of St Helena from continental Africa, we hypothesize that ancestors of Aplothorax colonized that island by active flight (potentially wind-assisted). This possibly occurred in a stepping-stone manner across seamounts of the St Helena/Guinea chain, which may previously have been emerged (Peyve 2011), the beetles subsequently losing their flight capacity through adaptation to the island ecosystems by evolution of progressive brachyptery, as is well documented in other insular insects (e.g., Roff 1990).
BEAST Median Divergence Times and 95% Credibility Intervals with MLE Comparison
Analysis . | Clocks . | Tree Model . | SS MLE . | PS MLE . | Root Median [95% CI] . | Carabidae Median [95% CI] . | Carabini Median [95% CI] . |
---|---|---|---|---|---|---|---|
A1 | 1 | Yule | −110,761.73 | −110,764.85 | 147.0 [129.0–167.3] | 126.1 [105.8–145.7] | 71.0 [53.8–88.5] |
A2 | 1 | Birth–death | −110,759.26 | −110,765.45 | 147.0 [127.9–166.1] | 125.6 [105.8–145.7] | 69.9 [53.1–87.3] |
A3 | 7 | Yule | −110,706.32 | −110,706.93 | 141.7 [125.4–160.8] | 127.3 [109.1–145.9] | 71.4 [58.2–84.5] |
A4 | 7 | Birth–death | −110,706.19 | −110,706.75 | 142.0 [124.4–159.8] | 127.4 [108.7–145.9] | 71.4 [58.6–85.0] |
Analysis . | Clocks . | Tree Model . | SS MLE . | PS MLE . | Root Median [95% CI] . | Carabidae Median [95% CI] . | Carabini Median [95% CI] . |
---|---|---|---|---|---|---|---|
A1 | 1 | Yule | −110,761.73 | −110,764.85 | 147.0 [129.0–167.3] | 126.1 [105.8–145.7] | 71.0 [53.8–88.5] |
A2 | 1 | Birth–death | −110,759.26 | −110,765.45 | 147.0 [127.9–166.1] | 125.6 [105.8–145.7] | 69.9 [53.1–87.3] |
A3 | 7 | Yule | −110,706.32 | −110,706.93 | 141.7 [125.4–160.8] | 127.3 [109.1–145.9] | 71.4 [58.2–84.5] |
A4 | 7 | Birth–death | −110,706.19 | −110,706.75 | 142.0 [124.4–159.8] | 127.4 [108.7–145.9] | 71.4 [58.6–85.0] |
NOTE.—SS, stepping-stone sampling; MLE, marginal likelihood estimate; PS, path-sampling; 95% CI, 95% credibility interval from posterior BEAST estimates.
BEAST Median Divergence Times and 95% Credibility Intervals with MLE Comparison
Analysis . | Clocks . | Tree Model . | SS MLE . | PS MLE . | Root Median [95% CI] . | Carabidae Median [95% CI] . | Carabini Median [95% CI] . |
---|---|---|---|---|---|---|---|
A1 | 1 | Yule | −110,761.73 | −110,764.85 | 147.0 [129.0–167.3] | 126.1 [105.8–145.7] | 71.0 [53.8–88.5] |
A2 | 1 | Birth–death | −110,759.26 | −110,765.45 | 147.0 [127.9–166.1] | 125.6 [105.8–145.7] | 69.9 [53.1–87.3] |
A3 | 7 | Yule | −110,706.32 | −110,706.93 | 141.7 [125.4–160.8] | 127.3 [109.1–145.9] | 71.4 [58.2–84.5] |
A4 | 7 | Birth–death | −110,706.19 | −110,706.75 | 142.0 [124.4–159.8] | 127.4 [108.7–145.9] | 71.4 [58.6–85.0] |
Analysis . | Clocks . | Tree Model . | SS MLE . | PS MLE . | Root Median [95% CI] . | Carabidae Median [95% CI] . | Carabini Median [95% CI] . |
---|---|---|---|---|---|---|---|
A1 | 1 | Yule | −110,761.73 | −110,764.85 | 147.0 [129.0–167.3] | 126.1 [105.8–145.7] | 71.0 [53.8–88.5] |
A2 | 1 | Birth–death | −110,759.26 | −110,765.45 | 147.0 [127.9–166.1] | 125.6 [105.8–145.7] | 69.9 [53.1–87.3] |
A3 | 7 | Yule | −110,706.32 | −110,706.93 | 141.7 [125.4–160.8] | 127.3 [109.1–145.9] | 71.4 [58.2–84.5] |
A4 | 7 | Birth–death | −110,706.19 | −110,706.75 | 142.0 [124.4–159.8] | 127.4 [108.7–145.9] | 71.4 [58.6–85.0] |
NOTE.—SS, stepping-stone sampling; MLE, marginal likelihood estimate; PS, path-sampling; 95% CI, 95% credibility interval from posterior BEAST estimates.
Our results predate the divergence times proposed by Andújar et al. (2014), who recovered a crown age for Carabinae ca. 30 Ma, compared with ca. 110 Ma in this study. Their approach based on Bayes Factor Cluster Analysis led them to select comparatively recent biogeographic and fossil calibrations, whilst simultaneously excluding older ones (i.e., the vicariance between Australia and New Zealand), resulting in very recent age estimates. Relying on a chimeric approach, Sota et al. (2020) constrained the crown of Carabini with a secondary calibration derived from that of Andújar et al. (2014) but also included the vicariance between Australia and New Zealand as a minimum age for the split between Maoripamborus and Pamborus, resulting in a likely unrealistic stem age for Carabini (ca. 120 Myr). Our results present a more balanced pattern, with a crown age for Carabinae broadly in line with that of Sota et al. (2020) despite our reliance on fossil and secondary calibrations. Correspondingly, our divergence time estimates for the crown of Carabus significantly predate those of Andújar et al. (2014) and Sota et al. (2020), who proposed a range of ca. 25–35 Ma. Similarly, our results postdate the estimates of Toussaint and Gillett (2018) who relied on a single deep secondary calibration between Trachypachus Motschulsky, 1844 and Carabinae, resulting in older divergence times. However, our estimates are consistent with recent reappraisal of ages in this group based upon new biogeographic and fossil calibrations (Opgenoorth et al. 2020). We believe that our Bayesian relaxed clock dating exercise, based on the most comprehensive review of the fossil record to date and incorporating recent secondary calibrations obtained from large-scale phylogenomic studies, is more likely to represent a good estimate of Carabinae evolution. Future inferences of comprehensively sampled and robust phylogenomic Adephaga timetrees based on multiple fossil calibrations will eventually unveil the sister group to Carabinae with more certainty.
Efficiency and Future of HyRAD-X-Like Approaches
The HyRAD-X approach has proved efficient in generating a comprehensive genomic data set leading to inference of a new phylogenomic hypothesis for Carabinae (fig. 3). The success of HyRAD-X appears to be a product of its flexibility. For instance, during the hybridization step, it is possible for historical DNA target sequences with some mismatch to nevertheless hybridize to the template probes. Moreover, during the phyloHyRAD bioinformatic mapping step, the use of the more relaxed BWA-MEM mapping algorithm instead of the more stringent BWA-ALN allowed for the recovery of much more data than anticipated (Ziemann 2016). Our strategy of distributing the probe-producing samples across the expected diversity in the phylogeny has made it possible to obtain homologous sequences for species that are phylogenetically distant. One exception was the outgroup T. carolina, for which comparatively fewer loci were recovered (258 with BWA-MEM and 197 with BWA-ALN), resulting in reduced phylogenetic resolution (fig. 3).
To investigate factors involved in the recovery of sequences and loci from historical samples, we estimated correlations between the number of sequenced reads or mapped loci and sample specificities. Firstly, we evaluated the influence of the genetic distance to a probe by estimating the phylogenetic distance between each sample and the phylogenetically closest sample used as a probe. A significant correlation was observed between this distance and the number of mapped loci obtained using BWA-ALN. A similar tendency was also observed for the number of loci obtained with BWA-MEM from the fresh samples (in black on fig. 5A) but no correlation was observed for historical samples (in yellow on fig. 5A). These results reveal a limited influence of probe diversity on their ability to hybridize to and recover sequences from fresh samples for which we had obtained DNA of high integrity. This result is in line with results from other approaches such as UCEs where locus recovery decreases with phylogenetic distance from taxa used in probe design (e.g., Faircloth et al. 2015; Gustafson et al. 2020). In contrast, for historical samples containing comparatively fragmented DNA, loci recovery success is likely dependent on other factors, such as the age of the sample or the storage conditions (e.g., when preserved in constant climatic conditions, DNA can still be better preserved), which may induce heterogeneity between samples. In addition, the fact that historical samples are characterized by overall smaller fragments than fresh samples might increase the likelihood of hybridization during the capture phase, even to probes that are phylogenetically distant (i.e., reduction of molecular stoichiometry during hybridization). Similarly, for these historical samples, no correlation between the age of the sample and the amount of retrieved genetic data, reads, and loci was observed (in yellow on fig. 5B). This result is surprising because such a correlation is usually recovered when applying capture approaches to museum specimens (e.g., Blaimer et al. 2016; McCormack et al. 2016). Nevertheless, a significant difference exists in the amount of retrieved genetic data, reads, and loci, between fresh and historical samples, with, as expected, a higher observed recovery success for fresh samples (fig. 5C).
The development of the new phyloHyRAD pipeline (fig. 1) is a major step forward for rapid and accurate locus assembly. This pipeline allows for the construction of a catalog from probe data, the mapping of reads from historical samples, data cleaning, and the construction of consensus sequences for loci that are subsequently combined and aligned to obtain the final alignments. Moreover, this pipeline is particularly flexible because it makes possible the integration of external data from extremely diverse capture methods, including RNA-Seq and UCEs. In this study, it resulted in the acquisition of sufficient loci for successful analysis, with a mean of 1,788 loci (SD 759) obtained with BWA-MEM and 530 loci (SD 421) with BWA-ALN from our samples (table 1). These features ensure that, in combination with the HyRAD-X approach, the pipeline is broadly applicable, allowing for the combination and integration of existing legacy data to further enrich data sets.

Schematic representation of the phyloHyRAD pipeline. In green is shown the construction of the probe loci used as a catalog for the mapping of the sample reads (shown in brown). The purple section indicates the mapping and cleaning steps that lead to construction of the combined and aligned loci for each sample, which are then concatenated and used in the phylogenetic analyses.

Graphical representation of locus recovery for each taxon. The histogram represents the number of final assembled HyRAD-X loci obtained with BWA-MEM for each taxon, including those that were not retained for final phylogenomic inferences. Taxa are ordered taxonomically with each major clade being illustrated by a habitus photograph of a representative species.
Deep sequencing of the libraries resulted in the sequencing of DNA fragments present in large quantities accompanying the captured and amplified fragments in the sample libraries. The HyRAD-X approach allowed for recovery of these high copy number sequences, including mitochondrial and ribosomal DNA, resulting in sequences for 13 mitochondrial genes from our historical samples. When combined with existing mitogenomes available for Carabinae (Sota et al. 2020) and analyzed under a maximum likelihood criterion, these data resulted in the inference of a large mitogenome phylogeny (supplementary figs. S8 and S9, Supplementary Material online), further illustrating the potential for HyRAD-X to enhance data accessibility and analysis possibilities from combining existing next generation or classical barcode sequence data. This tree recovers moderate branch support as well as inconsistent phylogenetic relationships with the analyses based on genome-scale data (fig. 3). This result is not surprising considering the high evolutionary rate of mitochondrial DNA and the potentially corresponding high levels of saturation and homoplasy in older lineages (Rubinoff and Holland 2005; DeSalle et al. 2017). Nevertheless, this phylogeny is likely to represent the most comprehensive one of Carabinae to date with relevant intrageneric phylogenetic placements (supplementary figs. S8 and S9, Supplementary Material online). However, we emphasize that the phylogenomic hypothesis presented in figure 3 is the most robust global subfamily-level estimate to date for the evolutionary history of Carabinae.

Best Maximum likelihood tree inferring Carabinae relationships. Best scoring maximum likelihood tree based on Data set A (2,945 loci, 43 taxa, BWA-MEM mapper). Branch support from this analysis is shown for all branches. Branch support retrieved in different analyses is shown for major branches according to the inserted caption. A live photograph of the Mongolian desert endemic Calosoma (Callisthenes) fischeri Fischer, 1842 is presented (Photograph credit: Lily Kumpe).

Bayesian divergence time estimates for Carabinae. Maximum clade credibility tree obtained from a BEAST analysis using seven Bayesian lognormal relaxed clocks and a Yule pure birth tree model. Node estimates are postburn in median ages, with 95% credibility intervals represented by a grey horizontal bar for each node. An illustration of Aplothorax burchellii scavenging on an endemic St. Helena Darter Sympetrum dilatatum (Calvert, 1892) is presented.

Statistical summary of locus recovery. (A) Plot representing the relationship between the minimal phylogenetic distance to a probe and three outcomes: the relative sequencing depth, the number of BWA-MEM mapped loci, and the number of BWA-ALN mapped loci. In each plot, fresh samples are shown in black, fresh samples also used for probes in green, and samples from museums with an age > 30 years in orange. Correlations were tested with Spearman’s correlation tests on fresh samples in black, museum samples in orange, and combined fresh and museum samples in gray. (B) Plots representing the relation between the age of the sample and the three previous outcomes. (C) Boxplots comparing the three previous outcomes between the museum samples and the fresh samples (including fresh samples also used for probes). Significance was evaluated using Wilcoxon rank tests. For the fresh samples additional boxplots show variation according to the sample storage history: 1) in absolute ethanol at room temperature, 2) in absolute ethanol at −20 °C, 3) dry at −80 °C, and 4) in RNAlater at −80 °C.
Conclusion
The HyRAD-X approach is a powerful and versatile addition to the phylogenomic toolbox, allowing for the generation of large data sets that are compatible with barcoding, mitogenomic, target capture, or RNA-seq data. Introduction of the new phyloHyRAD bioinformatic pipeline ensures that this approach is even more tractable, efficient, and accessible. Using this combined methodology, we infer a robust dated phylogenomic hypothesis for Carabinae giant ground beetles derived largely from museum collection specimens, thereby demonstrating that genomic-level DNA information preserved in historical specimens can be unlocked and widely exploited in evolutionary studies.
Materials and Methods
Taxon Sampling
We sampled 48 taxa for the purpose of this study, including two outgroups based on recent ground beetle phylogenomic evidence (Gough et al. 2020), namely Scarites buparius (Forster, 1771) (Coleoptera, Carabidae, Scaritinae) and the tiger beetle T. carolina (Coleoptera, Cicindelidae). The latter was used to root the trees. Fieldwork was conducted in Chile, Corsica, continental France, Sweden, and the United States to gather fresh specimens, which were collected either into 96% ethanol or RNAlater (Thermo Fisher Scientific) for probe design. In total six specimens were used to design the probe set: Calosoma (Castrida) sayi Dejean, 1826, Calosoma (Calosoma) sycophanta (L., 1758), Carabus (Platycarabus) irregularis Fabricius, 1792, Carabus (Hygrocarabus) variolosus Fabricius, 1787, Ceroglossus buqueti (Laporte, 1834), and S. buparius. The rest of the sampled specimens were either dry-pinned or ethanol-preserved museum specimens (supplementary table S1, Supplementary Material online). We selected representatives of all Carabinae genera, as well as most subgenera of Calosoma, to test Jeannel’s (1940) hypothesis that Aplothorax is nested within Calosoma, and to establish a robust backbone within that genus. Since one of our aims was to place the St Helena endemic Aplothorax burchellii (see Lorenz 2021 for reference on the correct spelling of this species) within the phylogeny of Carabinae, we sampled three historical specimens of this presumably extinct species from the Natural History Museum of Geneva collection. Because the monophyly of most Calosoma subgenera is doubtful (Su et al. 2005; Toussaint and Gillett 2018; Sota et al. 2020), we also selected representatives of all the main clades inferred for that genus by Toussaint and Gillett (2018).
RNA Extraction and Probe Preparation
The full HyRAD-X protocol for exome capture as applied to the subfamily Carabinae is available in supplementary file S1, Supplementary Material online. First, RNA was extracted from the six specimens used to design the probe set with a modified version of QIAGEN's RNeasy protocol for Purification of Total RNA from Animal Tissues. Double-stranded cDNA (ds cDNA) was then synthesized, its concentration measured, and quality assessed with Fragment Analyzer (Advanced Analytical Technologies). We used ddRAD (Peterson et al. 2012) following modifications by Mastretta-Yanes et al. (2015). After testing for digestion profiles using several couples of restriction enzymes, ds cDNA was eventually sheared with MseI and PstI. Two dsDNA adapters were ligated to the digested ds cDNA, each of them harboring an overhang compatible with either MseI or PstI-overhang. One of the adapters (ligated to the PstI-overhang) comprised a T7-promoter sequence necessary for final transcription of probes into RNA. The ligation product was PCR-amplified to obtain libraries compatible with sequencing on Illumina platforms and to add a unique index specific to each specimen used for probe design. A different PCR primer (annealing to the MseI, or P5, end) with a unique 6-bp index was used for further amplification. Libraries were size-selected using PippinPrep (Sage Science) on a 2% agarose cassette (SageScience) in range mode 200–600 bp and then underwent a second round of amplification. The six individual libraries were then sequenced on a lane of an Illumina MiSeq with 300-bp reads. These six libraries were transcribed into RNA and biotinylated in a single reaction using HiScribe T7 High Yield RNA Synthesis Kit (New England Biolabs). We measured the concentrations of each of the six RNA probes in a Qubit assay (Thermo Fisher Scientific) and used these values to prepare an equimolar pool solution for the subsequent hybridization capture step.
DNA Extractions and Shotgun Library Preparation
DNA extractions of one ground-up leg from each specimen were performed using a modified OmniPrep (G-Biosciences) protocol (see supplementary material, Supplementary Material online). Because specimens of the presumably extinct A. burchellii are particularly rare in museum collections, we performed a nondestructive DNA extraction on two specimens. The purified DNA was quantified, and its quality assessed with Fragment Analyzer. Shearing with Fragmentase (New England Biolabs) was performed for fresh samples before library preparation. We used a modified version of the protocol used in Suchan et al. (2016) for the preparation of shotgun libraries, based upon Tin et al. (2014) and Meyer and Kircher (2010), who employed methods specific to library preparation from single-stranded DNA of museum specimens and adapted to the preparation of multiplexed libraries, respectively. Purified DNA was first phosphorylated with T4 Polynucleotide Kinase, then heat-denatured and quickly chilled on an ice and water mix. G-tailing was performed with Terminal Transferase and second strand DNA was synthesized with Klenow Fragment (3ʹ->5ʹexo-) using a poly-C oligonucleotide. Blunt-end reaction was performed with T4 DNA Polymerase and barcoded adapters were ligated to the phosphorylated end (opposite from the poly-C end). After adapter fill-in with Bst DNA Polymerase (Large Fragment), two PCR replicates were run independently using Phusion U Hot Start DNA Polymerase (Thermo Scientific) and indexed PCR primers. The two PCR shotgun libraries replicates were pooled together, purified, and quantified in a Picogreen assay (Invitrogen). Libraries were pooled in equimolar quantities based upon their respective concentrations.
Hybridization Capture and Sequencing
Hybridization capture for enrichment of shotgun libraries was based on the MYbaits protocol (Arbor Biosciences), but with a two-step capture at different temperatures as suggested by Li et al. (2013). Here, we used an inverse-touchdown approach, with a first capture at 50 °C, whose product was then captured a second time at 65 °C to improve the stringency of the reaction (Orlando L, personal communication). Two independent library sequencing runs were performed on Illumina NovaSeq SP (Fasteris, Switzerland) and Illumina Hiseq2500 sequencers, using a paired-end protocol on rapid run mode (Lausanne Genomic Technologies Facility, Switzerland).
phyloHyRAD Pipeline
The first step of the phyloHyRAD pipeline (https://github.com/JeremyLGauthier/PHyRAD/tree/master/phyloHyRAD) (fig. 1) is the construction of a loci reference catalog. The sequence pairs generated from the ddRAD probe libraries were cleaned using AdapterRemovalv2 (Schubert et al. 2016) and Cutadapt (Martin 2011) to remove adaptors, bases with a quality score lower than 20 and reads smaller than 30 bp. Read quality was first checked using FastQC (Babraham Institute) before loci construction was performed using ipyrad (Eaton and Overcast 2020) with a minimum depth of 6 and a clustering threshold of 0.70 (following testing with values 0.70, 0.80, and 0.90). Ultimately, loci shared by at least two probes were retained in a reference catalog, which was evaluated for contamination using the metagenomic sequence classifier Centrifuge (Kim et al. 2016). Contrary to UCE analyses using phyluce (Faircloth 2016), the phyloHyRAD pipeline does not perform individual loci assembly from historical samples but capitalizes on the sequencing of probe sequences to generate a reliable catalog. During this step, ipyrad considers and removes putative paralogous loci if more than 50% of the shared SNP are heterozygous, parameter “max_shared_Hs_locus.” Reads from each historical sample were trimmed and cleaned using Cutadapt (Martin 2011) to remove barcodes, adaptors, bases with a quality lower than 20, and reads smaller than 30 bp. Terminal poly-Cs were removed using a custom Perl script (DropBpFastq_polyC.pl) before read quality was checked using FastQC.
Cleaned reads from each historical sample were individually mapped onto the loci catalog generated above using both a strict algorithm, BWA-ALN (Li and Durbin 2009), and a more relaxed one, BWA-MEM (Li 2013). Indels were realigned using the GATK IndelRealigner (McKenna et al. 2010) and PCR duplicates were removed using MarkDuplicates from the Picard toolkit (http://broadinstitute.github.io/picard). Nucleotide mis-incorporation patterns were investigated using MapDamage2.0 (Jónsson et al. 2013), and base quality scores were rescaled according to their probability of representing a postmortem DNA deamination event, to reduce the impact of DNA decay on downstream analyses. Finally, individual consensus sequences were generated for each locus using the following scripts from samtools suite: mpileup, bcftools, and vcfutils.pl (Li et al. 2009). Consensus sequences were cleaned using seqtk (https://github.com/lh3/seqtk) to retain bases with a phred quality > 30. Cleaned consensus sequences were combined using a custom script and aligned with MAFFT using the –auto option to automatically select an appropriate strategy for the alignment (Katoh and Standley 2013).
Integration of External Data
Raw reads from eight additional taxa were gathered from the SRA database and other online repositories to test the compatibility of HyRAD-X data with established phylogenomic data. Specifically, we collected raw reads from the following species: Calosoma (Calosoma) frigidum Kirby, 1837 (transcriptome, SRR2083640), Carabus (Carabus) granulatus L., 1758 (transcriptome, SRR596983), Carabus (Ohomopterus) iwawakianus (Nakane, 1953) (transcriptome, DRR089198), Carabus (Tanaocarabus) taedatus (UCEs, SRR10334070), Carabus (Ohomopterus) uenoi Ishikawa, 1960 (transcriptome, DRR089202), Carabus (Megodontus) violaceus L., 1758 (transcriptome, SRR10675209), Elaphrus aureus Müller, 1821 (Coleoptera, Carabidae, Elaphrinae) (transcriptome, SRR2083660) and Pasimachus viridans LeConte, 1858 (Coleoptera, Carabidae, Scaritinae) (transcriptome, https://doi.org/10.5061/dryad.8w9ghx3h9). These raw reads were integrated directly into the phyloHyRAD pipeline at the mapping step and treated in the same manner as the other samples, apart from omitting the MapDamage step because these data derived from nondegraded fresh samples.
Alignment Post-treatment
All assembled loci recovered from the relaxed algorithm were further cleaned in Geneious R11 (Biomatters) to remove short or problematic sequences. We relied on four main nucleotide data sets to infer phylogenetic relationships among Carabinae lineages. The first data set was composed of 2,945 loci processed using the relaxed BWA-MEM algorithm (see above) and 43 taxa (Data set A). Several taxa had a very low number of loci and were removed to prevent spurious phylogenetic inferences (i.e., specimens with <5 kb of data). A second data set was derived from the Data set A, with the exception that taxa represented by less than 20 kb of data were pruned, resulting in a final matrix of 40 taxa (Data set B). The third data set was produced using all recovered and clean loci from the strict algorithm, BWA-ALN, for a total of 1,227 loci and 40 taxa (Data set C). A fourth data set was generated with the loci selected for the BEAST analyses (Data set D, see below) to perform the divergence time analyses. For each data set, loci were then concatenated using AMAS (Borowiec 2016) to produce global alignments and partition files. Evaluation of the capture efficiency and its influence on the final data sets was performed by estimating the proportion of parsimony-informative sites across the entire alignments using AMAS and for each sample using FASconCAT-G (Kück and Longo 2014). Complementary alignment statistics have been estimated on each data set using Alistat (Wong et al. 2020). The phylogenetic distance between historical samples and probes was estimated using distTips from the adephylo package (Jombart et al. 2010). To investigate the impact of sample history on data recovery, relative sequencing depth (i.e., number of reads for each sample divided by the number of reads in the library), number of loci for each method and sample characteristics were analyzed and Spearman’s correlations were performed and plotted using R (R Project).
Recovery of Mitochondrial Genes from HyRAD-X Data
Mitochondrial sequences from the cleaned reads of each historical sample were assembled using MitoFinder (Allio et al. 2020), using the following publicly available mitochondrial sequences from GenBank as references: Calosoma sp. (GU176340), C. granulatus (MN122870), S. buparius (MF497822), and Tetracha sp. (MG253284). We thereafter built an additional data set (Data set E), combining these new mitogenomic data with the mitogenome data set created by Sota et al. (2020), resulting in a data matrix consisting of a total of 60 taxa, 14 loci (all protein-coding mitochondrial genes in addition to the 16S rRNA locus), corresponding to ca. 12 kbp.
Phylogenetic Inference
All phylogenetic analyses using the different nucleotide alignments were performed in IQ-TREE 2.0.5 (Minh et al. 2020) using the edge-linked partition model (Chernomor et al. 2016). Because simultaneously estimating the best models of nucleotide substitution and partitioning schemes was too computationally demanding, we used PartitionFinder 2.1.1 to first estimate the best partitioning schemes a priori, beginning with one partition per locus for all data sets. To conduct tractable analyses on a local cluster and on the CIPRES Science Gateway cluster (Miller et al. 2010), we used the rcluster algorithm under the Akaike Information Criterion corrected (AICc), with rcluster-max = 1,000 and rcluster-percent = 20. The resulting partitioning schemes were then used in IQ-TREE to select corresponding models of nucleotide substitution using ModelFinder (Kalyaanamoorthy et al. 2017) and the AICc across all available models in IQ-TREE. To avoid local optima, we performed 100 independent tree searches for each data set in IQ-TREE with default options. To estimate branch support, we calculated 1,000 ultrafast bootstraps along with 1,000 SH-aLRT tests in IQ-TREE (Guindon et al. 2010; Hoang et al. 2018). We used the hill-climbing nearest-neighbor interchange topology search strategy to avoid severe model violations leading to biased ultrafast bootstrap estimations (Hoang et al. 2018). We also conducted a separate analysis of the preferred Data set B (see Results) with 100 regular bootstraps to assess branch support. The best tree for each analysis was selected based on the comparison of maximum likelihood scores. Finally, we performed a RogueNaRok analysis (Aberer et al. 2013) to detect potential rogue taxa for each data set using the best-scoring ML tree and the 1,000 ultrafast bootstrap with default options.
Divergence Time Estimation
Divergence time estimation using RAD-sequencing data is complex not only because of the large number and short lengths of recovered RAD loci but also because of the large percentage of inherent missing data. Downsizing the initial data set is therefore necessary to facilitate divergence time computation in a probabilistic framework. To generate a data set that would be tractable for Bayesian inference of divergence times using relaxed clocks, we first excluded all loci not represented in at least 20 taxa and lacking a minimum length of 200 bp. From the initial set of 2,945 loci (BWA-MEM), this first limiting step resulted in a pool of 558 candidate loci. We then estimated phylogenetic trees for each resulting locus using IQ-TREE and a model of nucleotide substitution selected using ModelFinder. These phylogenetic trees and the best scoring tree from the species tree ML inference conducted with Data set A (see above) were used to conduct a gene shopping approach as developed by Smith et al. (2018) in SortaDate, using some underlying UNIX code and programs implemented in phyx (Brown et al. 2017). RAD loci were filtered using the following three criteria: 1) clock-likeness, 2) tree length, and 3) least topological conflict with the species tree. We selected the 100 best scoring RAD loci based on this filtering and concatenated them to produce the final alignment used in the divergence time estimation.
The Carabinae fossil record is relatively scarce. To calibrate the relaxed clocks implemented in BEAST, we relied on a selection of both fossil and secondary calibration while carefully avoiding circularity (i.e., the retained fossils were not used to obtain the ages that were in turn used as secondary calibrations). We used two fossils of Carabinae to calibrate the clocks using soft exponential prior distributions. Firstly, we used †Calosoma agassizi Barthélemy-Lapommeraye 1846 (ca. 27.8-33.9 Ma, Rupelian) from France (Nel 1988), hitherto surprisingly overlooked despite a series of well-preserved specimens. This fossil cannot be placed with certainty within Calosoma and is therefore used here as a crown calibration considering the comprehensive sampling of Calosoma in this study. An older and recently described fossil possibly belonging to Calosoma from the Vic Formation (ca. 33.9–37.8 Ma, Priabonian) in Spain (Farrés and Altimiras 2012), was not used because it did not provide a significantly different calibration and was less reliable based on morphological grounds. Secondly, we used Carabus (Ohomopterus) sp. from the Tatsumi-tôge formation in Tottori Prefecture, Japan (Hiura 1971); the fossil elytra have already been used by Deuve et al. (2012) to constrain the stem of Carabus (Ohomopterus) Reitter, 1896, and therefore we used the same placement in our analyses. Other good Calosoma and Carabus fossils exist (e.g., Deuve 1998; Yahiro et al. 2018; Kirejtshuk et al. 2019) but were not relevant considering the above-mentioned fossils and the taxon sampling in our final phylogeny (see Results). The three other tribes, Ceroglossini, Cychrini and Pamborini are not represented by any reliable usable fossils described to date, with many older descriptions being inaccurate (Nel A, personal communication). To avoid divergence time estimation biases, we included two secondary calibrations derived from the recent dated transcriptomic tree of beetles constructed by McKenna et al. (2019). Specifically, we used the crown ages of Cicindelidae+Carabidae (158 Ma, 95% CI = 138–184 Ma) and of Carabidae (121 Ma, 95% CI = 97–143 Ma) to constrain the two corresponding nodes in our topology with lognormal priors spanning the 95% credibility intervals of the estimates from McKenna et al. (2019). Importantly, we therefore avoided circularity by selecting secondary calibration from a study (McKenna et al. 2019) that did not use the two fossils we used to estimate divergence times.
All analyses were performed in BEAST 1.10.4 (Suchard et al. 2018). The best partitioning scheme and models of substitution were determined with PartitionFinder2 (Lanfear et al. 2017) using the rclusterf algorithm with parameters rcluster-max = 1,000, rcluster-percent = 20 and min-subset-size = 200, and the Bayesian Information Criterion algorithm to select between competing models. Because this algorithm relies on only three models (GTR, GTR+G, and GTR+I + G), we re-estimated a posteriori the best models using all those included in BEAST. The data set was partitioned a priori by locus for a total of 100 initial partitions. We implemented clock partitioning by conducting analyses with 1) a single clock for all partitions and 2) one clock for each partition (seven in total, see Results). We assigned a Bayesian lognormal relaxed clock model to the different clock partitions. We also tested different tree models by using a Yule (pure birth) or a birth–death model. The analyses consisted of 100 million generations with parameter and tree sampling every 5,000 generations. We estimated marginal likelihood estimates (MLE) for each analysis using path-sampling and stepping–stone sampling (Baele et al. 2012), with 1,000 path steps, and chains running for 1 million generations with a log-likelihood sampling every 1,000 cycles. The Maximum Clade Credibility tree of each analysis with median divergence age estimates were generated in TreeAnnotator 1.10.4 (Suchard et al. 2018) after removing the first 25 million generations as burn-in.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.
Acknowledgments
The authors are indebted to David Seldon (University of Auckland, Auckland, New Zealand) and Geoff Monteith (Queensland Museum, Brisbane, Australia) for sending critical specimens used in this study. They thank Louis Toussaint (Bazemont, France) for assistance in fieldwork, and Michael Balke (ZSM-SNSB, Munich, Germany) for sending samples of specimens housed in the ZSM collections. They also warmly thank André Nel (MNHN, Paris, France) for discussions and details on the fossil fauna of Carabinae ground beetles. They thank Lily Kumpe for permission to reproduce one of her photographs. They also thank four anonymous reviewers and the editor Mario dos Reis Barros for insightful suggestions. The authors declare that there is no conflict of interest.
Data Availability
The data underlying this article (final alignments, trees, and input files) are available at: https://doi.org/10.5281/zenodo.4748993. Raw reads are available on the SRA repository BioProject ID PRJNA706141. The whole pipeline including custom scripts are available at the Github repository https://github.com/JeremyLGauthier/PHyRAD/tree/master/phyloHyRAD.
Literature Cited
Prüser F, Mossakowski D. 1998. Conflicts in phylogenetic relationships and dispersal history of the subtribe Carabitae (Coleoptera: Carabidae). In: Ball GE, Casale A, Vigna Taglianti A, editors. Phylogeny and classification of Caraboidea (Coleoptera: Adephaga). Torino: Museo Regionale di Scienze Naturali. p. 297–328.
Author notes
Emmanuel F A Toussaint and Jérémy Gauthier are joint first authors.