Caecilian Genomes Reveal the Molecular Basis of Adaptation and Convergent Evolution of Limblessness in Snakes and Caecilians

Abstract We present genome sequences for the caecilians Geotrypetes seraphini (3.8 Gb) and Microcaecilia unicolor (4.7 Gb), representatives of a limbless, mostly soil-dwelling amphibian clade with reduced eyes, and unique putatively chemosensory tentacles. More than 69% of both genomes are composed of repeats, with retrotransposons being the most abundant. We identify 1,150 orthogroups that are unique to caecilians and enriched for functions in olfaction and detection of chemical signals. There are 379 orthogroups with signatures of positive selection on caecilian lineages with roles in organ development and morphogenesis, sensory perception, and immunity amongst others. We discover that caecilian genomes are missing the zone of polarizing activity regulatorysequence (ZRS) enhancer of Sonic Hedgehog which is also mutated in snakes. In vivo deletions have shown ZRS is required for limb development in mice, thus, revealing a shared molecular target implicated in the independent evolution of limblessness in snakes and caecilians.


Introduction
Living amphibians, frogs, salamanders, and caecilians, have diverged since the Triassic.They, or their ancestors, survived all mass extinctions including the Permian-Triassic which obliterated most terrestrial vertebrates (Wake and Vredenburg 2008).Our current extinction crisis places amphibians amongst the most threatened of the vertebrate groups (Blaustein and Wake 1990).In addition, the large and highly repetitive genomes typical in amphibia pose some of the greatest challenges for vertebrate genomics (Funk et al. 2018;Nowoshilow et al. 2018).Undoubtedly, reference quality genomes for amphibia will be important in addressing key aspects of their conservation, disease ecology and evolution, and breeding programs.
Caecilians (Gymnophiona) are the deepest diverging of the three extant amphibian orders and the sister group of the frogs and salamanders (Batrachia), diverging perhaps more than 300 million years ago (Siu-Ting et al. 2019).Compared to batrachians, caecilians are few in number (approximately 215 species).With mostly secretive burrowing lifestyles and restricted distributions in the wet tropics west of Wallace's line, they are relatively seldom encountered and often considered to be the least wellknown group of tetrapods (Wilkinson 2012).
Caecilians are highly distinctive in their elongate (from 10 to 2 month adult lengths), and externally segmented snake-or worm-like form.Living species lack any trace of limbs or girdles, have skulls that are comparatively heavily ossified compared to batrachians, and have very short tails or no tails at all, all features that are associated with the fossorial or burrowing habits of adults.Eyes are also greatly reduced with any loss of vision seemingly compensated for by a putative chemosensory pair of tentacles on the snout that are not found in any other taxa (Taylor 1968;Wilkinson 2012).Other unique features include a dual-jaw closing mechanism, a copulatory organ formed from the hind part of the gut (phallodeum), and persistent Mullerian ducts in males.Their scientific name Gymnophiona means "naked snakes" reflecting their perceived affinity to snakes albeit without scales.Ironically, some caecilians do have subdermal scales (quite different from the external scales of squamates) concealed in pockets or folds in the skin and are the only living amphibians to have scales.Like most other amphibians, caecilians are generalist predators as adults (Measey et al. 2004;Wilkinson 2012).
As with other living amphibians, oviparity with an aquatic larval stage and metamorphosis to a terrestrial adult is the ancestral reproductive mode within the group.Clutches of relatively few eggs are laid on land rather than in water, entailing a migration to water for any hatchling-larvae, and are invariably guarded until hatching by attending mothers.Other reproductive strategies include oviparity with direct terrestrial development and viviparity.Foetuses of at least some viviparous caecilians are believed to use specialized teeth to feed on the hypertrophied and lipidified oviduct linings of their mothers and it was discovered that in some oviparous direct developers, their hatchlings feed on the similarly modified maternal epidermis with similarly specialized vernal teeth (Kupfer et al. 2006;Wilkinson et al. 2013).Caecilian diversity is far from completely known and most of the described species are data deficient in the International Union for Conservation of Nature (IUCN) red list and thus, lack any assessment of their conservation status and threats.New higher taxa (families and genera) have been recently discovered and caecilian species are described every year (Kamei et al. 2012;Wilkinson et al. 2021).Although many aspects of caecilian biology remain to be adequately investigated, phylogenetic relationships of the ten currently recognized families are reasonably well-established, and support the generally accepted idea that caecilians are an ancient Gondwanan group with relatively recent and limited dispersals into Central America and South East Asia (Gower et al. 2002;Kamei et al. 2012).
The Rhinatrematidae, the deepest diverging (c.125 million year ago) of the ten caecilian families (Wilkinson et al. 2011), is represented by the only previously published caecilian genome Rhinatremata bivittatum, which is 5.3 Gb in size and was sequenced by the vertebrate genomes project (VGP) (Rhie et al. 2021).Here we provide reference quality genomes for two additional caecilian genomes, Geotrypetes seraphini (3.8 Gb) and Microcaecilia unicolor (4.7 Gb), and describe molecular level insights gleaned from their comparison with other vertebrate genomes.

Reference Genomes
The reference genomes of G. seraphini (Dermopdiidae) and M. unicolor (Siphonopidae) were assembled using four data types including Pacbio continuous long reads (CLR) and Hi-C reads, 10 × Chromium linked-reads, and BioNano optical maps (supplementary table S1, Supplementary Material online) and meet the VGP's 6.7.P5.Q40.C90 metric standards, the same used previously for Rhinatrema bivittatum and other vertebrates (Rhie et al. 2021).G. seraphini and M. unicolor, respectively, presented: contig N50 of 20.6 Mb and 3.6 Mb; scaffold N50 of 272 Mb and 376 Mb; and, Phred-scaled base accuracy Q43 and Q37 with 99% and 97% of sequences assigned to 19 and 14 chromosomes (table 1).Chromosomal units were identified and named by size (fig.1).The final assembly sizes were 3.8 Gb and 4.7 Gb, respectively (table 1).Manual curation was performed as in Howe et al. (2021) (supplementary fig.S1, Supplementary Material online) resulting in 69 and 55 removals of misjoins, 122 and 84 new joins, and 18 and 0 removals of false duplications for G. seraphini and M. unicolor, respectively.
A synteny analyses performed with single-copy Benchmarking Universal Single-Copy Orthologs (BUSCO) genes shows that chromosome content and gene order are conserved to a remarkable extent across caecilian chromosomes, with large blocks of collinear synteny up to chromosome-scale further conserved to anurans (common frog and toad) across more than 600 million years of evolution (fig.2).

Repeat Content
Substantial proportions of the caecilian genomes were found to consist of repeats: a total of 67.7%, 72.5%, and 69.3% for R. bivittatum (Rhie et al. 2021), G. seraphini and M. unicolor, respectively (supplementary table S2, Supplementary Material online).Class I transposable elements (TEs; retrotransposons) are ∼20 times more abundant (in base pairs) than Class II TEs (DNA transposons) and make up more than 30% of each caecilian genome.Long interspersed elements (LINEs) are the most abundant transposon type, followed by dictyostelium intermediate repeat sequences (DIRSs), that is tyrosine recombinase retroelements.These relative proportions differ from those found in the large genomes of other amphibians including caecilians; for example, a genomic low-coverage shotgun analysis of the caecilian Ichthyophis bannanicus (genome size 12.2 Gb) revealed more DIRSs than LINEs (Wang et al. 2021), while published salamander genomes are dominated by long terminal repeat (LTR) elements, with DIRSs never surpassing 7% of their content (Sun and Mueller 2014;Nowoshilow et al. 2018).These findings bolster the concept that repeated extreme TE accumulation in amphibians is not resulting from failure to control a specific type of TE (Wang et al. 2021).Caecilian Genomes Reveal the Molecular Basis of Adaptation • https://doi.org/10.1093/molbev/msad102MBE and LTRs/endogenous retrovirus [ERVs]) (Najafabadi et al. 2015).The emergence of novel gene families with these functional capacities at the origin of caecilians may have contributed to the unique pattern of TE accumulation we observe in this group; further work is needed.

Gene Family Analyses
We performed a gene birth and death analysis using CAFE v5 (Mendes et al. 2020) on the remaining 13,541 orthogroups, examining the ancestral and extant caecilian nodes where possible.The majority of these (10,035) orthogroups had no net change in gene family size between caecilian species and the ancestral amphibian node (8,065 orthogroups) or had insufficient sampling (1,970 orthogroups), and were excluded from further analysis.We reconstructed ancestral states for the remaining 3,506 orthogroups (supplementary table S5, Supplementary Material online).There were 156 orthogroups that were completely absent in G. seraphini and M. unicolor (most likely lost in their most recent common ancestor) (supplementary table S3, Supplementary Material online).Only 13 orthogroups showed significant changes in the number of caecilians (fig.3, supplementary table S6, Supplementary Material online), with five expansions at the ancestral caecilian node (ACN), and three at the internal caecilian node (ICN), of which one gene family is significantly expanded at both nodes.There are a total of three gene families with significant contractions, all of which are on the ACN.The gene families displaying significant expansions are: cytochrome P450 family 2 (ACN), these monooxygenases catalyze many reactions involved in the metabolism of a large number of xenobiotics and endogenous compounds (Manikandan and Nagini 2018); butyrophilin (BTN) family (ACN), involved in milk lipid secretion in lactation and regulation of the immune response (Afrache et al. 2012); tripartite motif (TRIM) family (ACN and ICN) involved in a broad range of biological processes that are associated with innate immunity (Ozato et al. 2008); and H2A and H2B histones (ICN), which

MBE
together with H3 and H4 histones and DNA form a nucleosome (Koyama and Kurumizaka 2018).In contrast, while immune function-related BTN and TRIM families have significant expansions at the ACN and/or ICN, both immunoglobulin heavy and light variable gene families have significant contractions at the ACN.The final gene family displaying significant contractions is gamma crystallin, a structural protein found largely in the nuclear region of the lens of the eye at very high concentrations (Vendra et al. 2016).Changes in these gene family repertoires may have contributed to the transition to a fossorial lifestyle and the packaging of a large genome.

Identification of Genes With Signatures of Positive Selection
Variation in selective pressure was assessed using codonbased models of evolution to assess changes in dN/dS across sites and lineages as implemented in codeml in the PAML package (Yang 2007).All 1,935 gene families that reached our criteria (see Materials and Methods) were analyzed.These 1,935 gene families were functionally enriched for Gene Ontology (GO) terms "extracellular structure organization", "developmental process", "regulation of biological process", "response to stress", "cell communication", "signal transduction", "regulation of signaling", and "leukocyte differentiation" (supplementary table S7, Supplementary Material online).The lineages specified as foreground were the branch leading to the extant caecilians (ACN), and all terminal and internal branches within the caecilian clade.The selective pressures, that is positive or negative selection or neutral evolution, were estimated for each gene within each foreground lineage and were compared to all other vertebrates (background lineages) in the alignment.Here we report the signatures of positive selection (dN/dS > 1) identified in homologs on the foreground (i.e., caecilian) lineages.After Bonferroni correction, we detected 379 orthologous families with evidence of caecilian MBE lineage-specific positive selective pressure (supplementary table S8, Supplementary Material online).We did not find any statistical enrichment for GO functions in the genes under positive selection on the nodes tested.Examples of genes with signatures of positive selection on the ACN are FBN1 (under positive selection on both the ACN and the ICN), AGTPBP1, and CEP290 all of which are involved in eye morphogenesis (Chakrabarti et al. 2006;Sheck et al. 2018;Stephenson et al. 2020).Genes with signatures of positive selection on the ICN include: Thrombomodulin (THBD) (involved in the reduction of thrombin), Wnt ligand secretion mediator (WLS) (enables Wnt-protein binding activity and is involved in several processes, including animal organ development; mesoderm formation; and positive regulation of canonical Wnt signaling pathway), and CD8A (mediates efficient cell-cell interactions within the immune system).
In  (Olsen 1997;He and Karsdal 2016).In summary, whilst the biological processes and functions of the genes under positive selection are not significantly enriched, there are several genes implicated in organ (especially eye) development and morphogenesis.Caecilian tentacles can be considered as compensation for reduced vision through enhanced olfaction, and they are thought to be materially related (by transformation) to components of the visual system such as eyelids and lacrimal ducts (Billo and Wake 1987).Therefore, a tentative explanation for the positive selection we observe on genes associated with organ (eye) development and morphogenesis is the origin of the tentacles from ancestral visual components.
The approach we have taken in our analysis of selective pressure variation is necessarily stringent, and, therefore, not a complete assessment of the entire genome where there are likely many other processes at work.

Analysis of ZRS Enhancer Loss
Some key enhancers for developmental regulator genes are very strongly conserved at the sequence level across all vertebrates.For example, the I12a enhancer element, located between homeobox genes Dlx1 and Dlx2, is known to be conserved from bony fish to mice (Plessy et al. 2005).
Analysis of the ortholog of the l12a enhancer across the 22 vertebrate species confirms that it is easily identifiable and conserved in all vertebrates, including the three caecilians (fig.3).Similarly, the ZRS enhancer element for the Sonic hedgehog gene (Shh), which is located within an intron of the LMBR1 gene, is almost ubiquitously conserved in vertebrates.However, snakes contain a mutant form of ZRS that when placed into mice produces a "serpentised" phenotype, directly implicating loss of ZRS function in vertebrate limblessness (Kvon et al. 2016).From the fossil record, we know that snake limblessness pre-dates that of limbless lizards, also reflected in a higher level of divergence in limb regulatory elements in snakes in comparison to limbless lizards.Indeed, ZRS is intact in limbless lizards where more complex and lineage-specific routes to limblessness have been proposed (Roscito et al. 2022).Here we show that the conserved ZRS element is absent (or mutated beyond recognition) in the three caecilian genomes.Specifically, there is no trace of homology by sequence matching (fig.3), and a conserved ETS1 binding site within the ZRS enhancer element, which has been shown to be critical for limb development in mouse and is missing in snakes (Lettice et al. 2012;Kvon et al. 2016), is also entirely missing in caecilians (supplementary fig.S2, Supplementary Material online).Combined with the functional work on the mutated form of the snake ZRS, this may provide us with a potential common molecular target implicated in the convergent loss of limbs in snakes and caecilians.Alternatively, the observed pattern of loss of the ZRS element in caecilians could be secondary to the loss of limbs.Similar to the situation in lizards (Roscito et al. 2022), the loss of limbs in caecilians could have been piecewise and relaxation of selective pressures on the ZRS region could have resulted in its eventual loss from caecilian genomes.Functional analysis will be needed to finally resolve the history of limb loss in this major amphibian group.

Sample Preparation and Genome Assembly
Genome sequences were produced from wild-caught animals that had been maintained in captivity for several years.Specimens are at the Natural History Museum, London cataloged under their unique field tags: G. seraphini (MW11051) from Kon, Cameroon, and R. bivittatum A further 5 SMRTcells of G. seraphini were sequenced with S/P3-C1/5.0-8M sequencing chemistry on a Pacific Biosciences Sequel II machine.The Hi-C libraries were created with a Dovetail Hi-C kit for G. seraphini and an Arima Genomics kit (version 1) for M. unicolor and sequenced on an Illumina HiSeq X.A 10 × Genomics Chromium machine was used to create the linked-read libraries which were sequenced on an Illumina HiSeq X. Optical maps were created for both species using a Bionano Saphyr instrument.Raw reads statistics and data access links are available in supplementary table S1, Supplementary Material online.Assembly for G. seraphini and M. unicolor was conducted mainly for R. bivittatum as described in (Rhie et al. 2021) using four data types and the VGP assembly pipeline (version 1.6 for G. seraphini and version 1.5 for M. unicolor; supplementary fig.S1, Supplementary Material online).In brief, the Pacific Biosciences CLR data for each species was input to the diploid-aware longread assembler FALCON and its haplotype-resolving tool FALCON-UNZIP (Chin et al. 2016).The resulting primary and alternate assemblies of M. unicolor were input to Purge Haplotigs (Roach et al. 2018) and G. seraphini assemblies were input to Purge_dups (Guan et al. 2020) for identification and removal of remaining haplotigs.Both species' primary assemblies were subject to two rounds of scaffolding using 10 × long molecule linked-reads and Scaff10 × (https://github.com/wtsi-hpag/Scaff10X),and one round of Bionano Hybrid-scaffolding with preassembled Cmaps from 1-enzyme non-nicking (direct labelling enzyme [DLE]-1) and the Solve Pipeline.The resulting scaffolds were then further scaffolded into chromosome-scale scaffolds using the Dovetail/Arima library Hi-C data for G. seraphini/M.unicolor and SALSA2 (Ghurye et al. 2019).The scaffolded primary assemblies plus the Falcon-phased haplotigs were then subjected to Arrow (Chin et al. 2013) polishing with the Pacbio reads and two rounds of short read polishing using the 10 × Chromium linked-reads, longranger align (Bishara et al. 2015), freebayes (Garrison and Marth 2012) and consensus calling with bcftools (Danecek et al. 2021) (further details available in Rhie et al 2021, and supplementary fig.S1, caecilian representation in the following ways: (1) to identify orthogroups that lack representation across all amphibia: we identified orthogroups that contained at least two fish species and two tetrapod (nonamphibian) species-totalling 265 orthogroups, (2) to identify orthogroups that are absent only in caecilians: we extracted those orthogroups with least two fish species and two tetrapod species (including at least one frog species)-totalling 238 orthogroups, (3) to identify orthogroups that are present across amphibia and amniota but absent in caecilians: we extracted orthogroups containing two frog species and two amniota speciestotalling 22 orthogroups.Orthogroups that did not contain caecilian sequences and that did not satisfy these filters were set aside.Combining the set of orthogroups that contain caecilian representatives (13,541) plus those that passed our filters 1-3 above (525), produced our final set of 14,066 orthogroups for analysis in CAFE v5 with Poisson distribution option and the lambda parameter (rate of change of evolution) estimated for each species (Mendes et al. 2020)

Analysis of Selective Pressure Variation
Our selective pressure variation analysis focussed on 3,236 single-copy orthogroups (single gene ortholog [SGOs]) and 9,690 multicopy genes (MCGs) from our orthogroups.The 9,690 MCGs obtained from the CAFE analysis, could be further broken down into SGO clusters as follows: 3,464 contained species-specific duplications in a single lineage, and were designated SGOs by removal of the single lineage containing the duplications; the remaining 6,226 were divided into their constituent single-copy paralogous groups using UPhO (Ballesteros and Hormiga 2016).Species-specific gene duplications that were not specific to caecilians were removed.In total, this provided 14,807 SGOs (3,236 original SGOs plus 11,571 SGOs generated from MCGs) for further analysis.We used three different alignment methods on the amino-acid sequences for these SGOs (i.e., MAFFT (Rozewicki et al. 2019) (with -auto option), MUSCLE (Edgar 2004), and Prank (Löytynoja 2014) (with -nobppa option)), and used MetAl (Blackburne and Whelan 2012) to assess the statistical significance of the resultant alignments.If the difference between alignments was ≥5%, the alignment with the highest NorMD (Thompson et al. 2001) score was used.The corresponding gene trees were reconstructed using IQtree (Nguyen et al. 2015) (with 100 bootstraps for each tree and models of best fit selected on a gene-by-gene basis).Robinson-Foulds distances between each of the gene trees generated and the canonical species tree were estimated using Clann (Creevey and McInerney 2005), and only those gene trees with zero distance were retained for further analysis, that is the gene and species tree were required to be in full agreement thus, minimizing the risk of hidden paralogy in our single-copy gene orthogroups (SGOs).It has been shown that codeml provides more accurate predictions when a minimum of seven species are present in the dataset (Anisimova et al. 2002), gene families that did not meet this criterion were not considered for selective pressure variation analysis.We assessed the patterns of selective pressure variation on the remaining 1,935 SGOs using codon-based models of evolution in codeml (Yang 2007) using our pipeline for large-scale analyses "Vespasian" (Constantinides et al. 2021).The models we employed are a set of standard nested models which are automatically compared by Vespasian using likelihood ratio tests with significance calculated using the appropriate degrees of freedom.The models used were the neutral model M1Neutral, its lineage-specific extensions model A, and the null model for model A. M1Neutral allows two site classes for dN/dS (referred to as ω throughout): ω0 = 0 and ω1 = 1.Model A assumes the two site classes are the same in both foreground and background lineages (ω0 = 0 and ω1 = 1) and ω2 for the foreground is estimated from the data and free to vary above 1.Model A null estimates a ω2 value also, but here it is restricted to below 1 thus, allowing sites to be evolving under either purifying selection or to be neutrally evolving but not permitting positive selection.Sequences were considered to exhibit lineage-specific selective pressure if the likelihood ratio test for ModelA is significant in comparison to both ModelA null and M1Neutral.All alignments (codon-based and amino-acid) for the selective pressure analyses are available at DOI:10.5281/zenodo.7540729.The GO terms were predicted for all caecilian CDSs using eggNOG with "orthology restrictions" option set to "transfer annotations from one-to-one orthology only" (eggnogmapper.embl.de)(Huerta-Cepas et al. 2019) and all other parameters as default.GO term enrichment analysis was carried out using goatools (Klopfenstein et al. 2018) with Taxonomic Scope auto-adjusted per query.

Comparative Analysis of the ZRS Enhancer
The ZRS enhancer sequence is located within an intron between exons 5 and 6 of the mouse LMBR1 gene sequence (Gene ID: 105804842) (Kvon et al. 2016).The LMBR1 sequence was extracted from the genomes of each species in our sample set (supplementary table S11, Supplementary Material online) and the homologous intron sequence containing the ZRS sequence was identified across all species.Using BLASTn (Camacho et al. 2009) the ZRS region was readily identifiable across all 22 noncaecilian species (fig.3, and supplementary fig.S2, Supplementary Material online) but was not detectable in the three caecilian genomes.The ZRS sequence was also searched against the reference genome assemblies of all three caecilians (to account for possible relocation of the enhancer) and we did not identify a ZRS-like sequence in an alternative location in the caecilian genomes.Using the Ovchinnikov et al. • https://doi.org/10.1093/molbev/msad102MBE same approach, we quantified the level of sequence conservation across our set of vertebrates for an additional enhancer, I12a (AF349438.2),located between the homeobox bigene cluster paralogs DLX1 and DLX2 (supplementary table S11, Supplementary Material online).The DLX1 gene was not annotated for Crocodylus porosus, therefore, we used the region between METAP1D and DLX2.
FIG. 1. Geotrypetes seraphini and Microcaecilia unicolor genome Hi-C contact maps, respectively.The contact maps show Hi-C reads at 8.192 Mb resolution in HiGlass.The top two panels are G. seraphini and the bottom two panels are M. unicolor and, in both cases, the left panel is before and the right panel is after manual curation.Chromosomes are ordered from large (left/top) to small (right/bottom).After the VGP Assembly Pipeline and manual curation, 99.8% and 97% of sequences were assigned to 19 and 14 chromosomes for G. seraphini and M. unicolor, respectively.

FIG. 3 .
FIG. 3. Summary of sequence conservation of two enhancer elements across vertebrates (ZRS and l12a).The vertebrate species phylogeny used throughout this study is shown on the left with the significant gene gain and loss events noted on the ancestral and internal caecilian nodes (ACN and ICN), respectively.The histogram shows the level of sequence conservation identified by BLASTN for each species for two enhancers: I12a (pale shaded bars) and ZRS (dark shaded bars).Snakes and caecilians are highlighted as they independently evolved limbless morphologies.Animal images are taken from http://phylopic.org/.
(Hoshii et al. 2007;Kinzel et al. 2014)uppke et al. 2017;Sanghvi et al. 2019eages are described as follows (specific internal caecilian lineage in parenthesis): HESX1 (R. bivittatum) required for the normal development of the forebrain, eyes, and other anterior structures such as the olfactory placodes and pituitary gland(Dattani et al. 1998); NFE2L2 (G.seraphini), a transcription factor that plays a key role in the response to oxidative stress(Huang et al. 2000;Eggler et al. 2009;Huppke et al. 2017;Sanghvi et al. 2019); LGR4 (R. bivittatum) is involved in the development of the anterior segment of the eye(Siwko et al. 2013)and is required for the development of various organs, including kidney, intestine, skin, and reproductive tract(Hoshii et al. 2007;Kinzel et al. 2014); COL9A3 (M.unicolor) encodes a component of Collagen IX -a structural component of cartilage, intervertebral discs and the vitreous body of the eye . All 3,506 orthogroups showing expansions or contractions within caecilians are provided in supplementary table S5, Supplementary Material online, and orthogroups with significant expansions and contractions are detailed in supplementary table S6, Supplementary Material online.