Chromosome-level genome assembly for the Aldabra giant tortoise enables insights into the genetic health of a threatened population

Abstract Background The Aldabra giant tortoise (Aldabrachelys gigantea) is one of only two giant tortoise species left in the world. The species is endemic to Aldabra Atoll in Seychelles and is listed as Vulnerable on the International Union for Conservation of Nature Red List (v2.3) due to its limited distribution and threats posed by climate change. Genomic resources for A. gigantea are lacking, hampering conservation efforts for both wild and ex situpopulations. A high-quality genome would also open avenues to investigate the genetic basis of the species’ exceptionally long life span. Findings We produced the first chromosome-level de novo genome assembly of A. gigantea using PacBio High-Fidelity sequencing and high-throughput chromosome conformation capture. We produced a 2.37-Gbp assembly with a scaffold N50 of 148.6 Mbp and a resolution into 26 chromosomes. RNA sequencing–assisted gene model prediction identified 23,953 protein-coding genes and 1.1 Gbp of repetitive sequences. Synteny analyses among turtle genomes revealed high levels of chromosomal collinearity even among distantly related taxa. To assess the utility of the high-quality assembly for species conservation, we performed a low-coverage resequencing of 30 individuals from wild populations and two zoo individuals. Our genome-wide population structure analyses detected genetic population structure in the wild and identified the most likely origin of the zoo-housed individuals. We further identified putatively deleterious mutations to be monitored. Conclusions We establish a high-quality chromosome-level reference genome for A. gigantea and one of the most complete turtle genomes available. We show that low-coverage whole-genome resequencing, for which alignment to the reference genome is a necessity, is a powerful tool to assess the population structure of the wild population and reveal the geographic origins of ex situ individuals relevant for genetic diversity management and rewilding efforts.

improve conservation management programs [11].The recently established reference genome for one of the Galápagos giant tortoises, Chelonoidis niger abingdonii, revealed insights into potentially aging, disease, and cancer-related gene functions by analyzing gene content evolution among tortoises [12].For Aldabra giant tortoises, only short-read sequencing data is available from the same study [12].
Aldabrachelys gigantea have been successfully used in rewilding projects on several Western Indian Ocean Islands, whose endemic giant tortoise species have recently gone extinct [13].
The introduced populations act as ecological replacements for the extinct species and take a central role in shaping and sustaining large-scale vegetation dynamics as the largest frugi-and herbivore [14][15][16][17].Aldabrachelys gigantea have been introduced on the Mauritian islets of Ile Aux Aigrettes and Round Island and on the island of Rodrigues [18].Monitoring the effectiveness of these rewilding projects will be crucial to catalyzing larger projects in Madagascar [19].Aldabrachelys gigantea rewilding programs require genomic monitoring to reduce founder effects and maximize genetic variation in newly introduced populations [20].
Finally, uncertainties exist about the existence of additional Aldabrachelys lineages, as well as the number and taxonomic status of extinct lineages [10] due to weak morphological resolution and low-resolution genetic marker sets [21,22].
In this study, we present the first high-quality chromosome-level genome of A. gigantea using PacBio HiFi sequencing and Hi-C sequencing for scaffolding.We assessed the utility of the reference genome by performing low coverage whole-genome resequencing for a total of 32 tortoises.We inferred the genetic structure of the wild population and the likely origin of zoohoused individuals.

DNA extraction, PacBio library preparation, and sequencing
In December 2020, during routine veterinary blood sampling, a subsample of approximately 3 mL of whole blood was collected from a female A. gigantea (named Hermania) living in the Zurich Zoo since 1955.Because blood was subsampled during a routine veterinary blood sampling, no additional ethical approval was required.Whole blood was taken from the animal's dorsal tail vein and stored on ice in a heparin-coated blood collection tube.DNA extraction was carried out at the Genetic Diversity Center, ETH, Zurich, according to the manufacturer's instructions of MagAttract® High Molecular Weight DNA (HMW) Kit (Qiagen), with a single modification: instead of using 200 µl whole blood as suggested for blood samples with non-nucleated red blood cells, a total of 50 µl whole blood was used.The purified DNA was eluted in 200 µl molecular-grade water.Subsequent steps, including gDNA quality control, PacBio HiFi library preparation, and sequencing, were carried out at the Functional Genomic Center Zurich, ETH.
The input HMW genomic DNA concentration was measured using a Qubit Fluorometer (Thermo), and the DNA integrity was checked on a Femto Pulse Device (Agilent).The HiFi library preparation started with 14 μg HMW DNA.The PacBio HiFi library was produced using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences), according to the manufacturer's instructions.Briefly, the DNA sample was mechanically sheared to an average size of 20 Kbp using a Megaruptor 3 Device (Diagenode).A Femto Pulse gDNA analysis assay (Agilent) was used to assess the resulting fragment size distribution.The sheared DNA sample was DNA damage repaired and end-repaired using polishing enzymes.PacBio sequencing adapters were ligated to the DNA template.A Blue Pippin device (Sage Science) was used to size-select fragments >15 Kbp.The size-selected library was quality inspected and quantified using a Femto Pulse gDNA analysis assay (Agilent) and a Qubit Fluorometer (Thermo), respectively.The SMRT® bell-Polymerase Complex was prepared using the Sequel® II Binding Kit 2.0 and Internal Control 1.0 (Pacific Biosciences) and sequenced on a PacBio Sequel II instrument using the Sequel II Sequencing Kit 2.0 (Pacific Biosciences).In total, two Sequel II SMRT Cells 8M (Pacific Biosciences) were run, taking one movie of 30 hours per cell.This yielded 49.4 Gbp of HiFi reads with a mean read length of 22.8 Kbp, which corresponds to approximately 20.8× coverage of the genome (Table 1).

Nuclear genome assembly, contamination scan, and evaluation
The consensus circular sequences per each Sequel II SMRT Cell (Pacific Biosciences) were filtered for adapter contamination with HiFiAdapterFilt v2.0.0 [23,24].Overall, 0.008% of the HiFi reads were filtered out.Genome size and heterozygosity rate were estimated based on the 17-mer frequency of the cleaned HiFi reads with GCE v1.0.2 (GCE, RRID: SCR_017332) [25,26].Our results indicate that A. gigantea has an estimated genome size of 2.37 Gbp (Fig. S1) and low heterozygosity of 0.072% (corresponding to 0.72 SNPs per 1 Kbp).
Scanning for contaminant contigs in the draft assembly was performed by following three approaches.First, the draft assembly was split into 5 Kbp segments using SeqKit v0.16.1 (SeqKit, RRID:SCR_018926) [34,35].Each segment was blasted against the full NCBI nonredundant protein database by running diamond v2.0.9 (DIAMOND, RRID:SCR_016071) [36,37] with the blastx option.The decision of a segment coming from an external source was given by considering the bitscore evalue and GC content, and none of the hits were considered a significant match with cut-off values of 30, 0.0001, 70%, respectively.
We assessed the completeness of the assembly based on a BUSCO analysis of single-copy orthologs v5.1.2(BUSCO, RRID:SCR_015008) [43,44] with default parameters and the sauropsid dataset (sauropsida_odb10) in the genome mode.

Hi-C sequencing and genome scaffolding
The Hi-C library was constructed with a 250 µl whole blood sample that was first fixed with 1% formaldehyde for 15 min at room temperature.Then, solid glycine powder was added to obtain a final concentration of 125 mM and incubated for 15 min at room temperature with periodic mixing.After centrifugation, the pellet was resuspended in PBS + 1% Triton-X solution and incubated at room temperature for 15 min.Then, the nuclei were collected after the mixture was spun down.The cross-linked sample was sent on dry ice to Phase Genomics (Seattle, Washington, USA) for sequencing.The Hi-C library was generated using the Phase Genomics Proximo Animal kit version 4.0.Briefly, the DNA sample was digested with DpnII and the 5'-overhangs were filled while incorporating a biotinylated nucleotide.The blunt-end fragments were ligated, sheared, and the biotinylated ligation junctions captured with streptavidin beads.The resulting fragments were sequenced on a NovaSeq 6000 150 bp pairedend run.A total of 680 million reads were produced, corresponding to approximately 85× coverage of the genome (Table 1).
Overall, 90.3% of the Hi-C reads were aligned to the draft genome assembly, sorted, and merged.Then duplicates were removed using Juicer v1.6 (Juicer, RRID:SCR_017226) [45,46] with default parameters.Approximately 87% of the reads were found to have Hi-C contacts.
Afterward, the 3D-DNA pipeline was run with default parameters to generate a candidate assembly [47,48], which was reviewed using JBAT v2.10.01 [49].Finally, a high-quality chromosome-level genome assembly was generated after a visual review on JBAT [49].A total of 26 pseudo-chromosomes were anchored, corresponding to 97.6% of the estimated genome size, yielding a chromosome-level assembled reference genome with an N50 of 148.6 Mbp (Table 1, Fig. 1C) and a BUSCO completeness of 97.3% (Fig. 1D, Table 1).Genome assembly statistics were visualized with a snail plot in BlobToolKit v2.6.4 [50,51] (Fig. 1E).The chromosome level assembly of A. gigantea (AldGig_1.0)has the longest contig and scaffold N50, and one of the highest BUSCO completeness scores of all available chromosome-level assembled chelonian genomes (Table 2).

RNA extraction and sequencing
A whole blood sample of approximately 1 mL was collected from an individual named Grosser Bub ('Big Boy') during routine veterinary blood sampling in the Zurich Zoo.A total of 125 µl of whole blood was immediately diluted with the same amount of water, added into TRIzol™ LS Reagent (Invitrogen), and stored on ice for < 2 hours until extraction.RNA was extracted at the Genetic Diversity Center, ETH, following a combination of a TRIzol™ LS (Invitrogen) RNA isolation protocol and the RNeasy Mini Kit (Qiagen).First, the sample was incubated at room temperature for 5 min.Then, 0.2 mL chloroform was added to the sample and the mixture was inverted for 15 seconds, followed by 3 min incubation at room temperature.The resulting mixture was centrifuged at 11,000 rpm for 15 min at 4°C.After centrifugation, the upper phase containing the RNA was collected, mixed with 1x 70% ethanol, and transferred to an RNeasy spin column.For the remaining procedure, the protocol "Purification of Total RNA from Animal Tissues" of the kit was followed, starting from step 6.Briefly, the RNA was bound to the spin column, washed, and eluted in 30 μl molecular grade water.The initial quality control of the RNA was done on a TapeStation (Agilent) and the concentration was measured with a Qubit Fluorometer (Thermo).
The PacBio IsoSeq library for RNA sequencing was produced at the Functional Genomic Center Zurich using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences), according to the manufacturer's instructions.A total of 300 ng RNA was used as input for the cDNA synthesis, which was carried out using the NEBNext® Single Cell/Low Input cDNA  1).

Gene prediction and annotation
Gene prediction was performed using a combination of ab initio and evidence-based prediction methods (RNA-seq and homology-based) with the braker2 pipeline v2.1.5(BRAKER, RRID:SCR_018964) [58][59][60][61][62].All gene predictions were performed with pre-trained parameter sets for chicken, which is the evolutionarily closest taxon for A. gigantea available within the software.Using pre-trained parameters yielded more complete annotations compared to training with extrinsic evidence (i.e.RNA-seq and protein data).The ab initio prediction was performed by utilizing the soft-masked reference genome.Evidence for the transcriptomebased prediction was based on combining information from A. gigantea PacBio Iso-seq and all available RNA-seq databases from chelonians in closely related genera (Chelonoidis spp.and Gopherus spp.; Table S3).For the alignment of short and long read transcripts, the splice-aware alignment tools STAR v2.7.9 (STAR, RRID:SCR_004463) [63,64] and minimap2 v2.24 (Minimap2, RRID:SCR_018550) [65,66] were used, respectively.Additionally, evidence for the homology-based prediction consisted of a protein database combining all vertebrate proteins in the OrthoDB v10 (OrthoDB, RRID:SCR_011980) [67] and the protein sequences of Gopherus evgoodei (NCBI RefSeq: GCF_007399415.2) and Chelonoidis niger abingdonii     1, Table S5).
We also performed a complementary collinearity analysis based on orthologous gene sets of A. gigantea and the phylogenetically closest available chromosome-level assembled Gopherus evgoodei (NCBI RefSeq: GCF_007399415.2) reference genomes (split time ca.50 my [93]).

Sample collection for low coverage whole-genome resequencing
The native distribution of A. gigantea is restricted to the Aldabra Atoll (Fig. 1B) with deep water channels separating the four main islands (Grande Terre, Malabar, Polymnie, and Picard; Fig. 3A).The smallest island Polymnie no longer harbors any tortoises [101].Tortoises were also harvested to extinction on Picard in the 1800s but the island has since been re-populated through translocations from Malabar and Grande Terre [102].In addition to the Atoll, there is an unknown, but large number of ex-situ individuals in captive, semi-natural, or rewilded populations [13].Assessments of the genetic health of native and rewilded populations will be crucial to inform future species management.However, the uncertainty about genomic vulnerabilities and which ex-situ individuals to use for re-wilding efforts constitute significant barriers.
To assess the utility of our reference genome resources to improve the genomic monitoring and inform re-wilding efforts, we performed low-coverage whole-genome sequencing of a representative sample of two main islands as well as zoo-housed individuals.Low-coverage sequencing is a powerful and cost-effective approach for conservation and population genomics [103], as well as ancient DNA analyses [104].We collected blood samples from a total of 30 adult A. gigantea (Table S6) from Malabar (East, N=10; West, N=5) and Grande Terre (East, N=8; South, N=4; West, N=3) (Fig. 3A).The collection yielded ~200 µl of blood from the cephalic vein of a front limb.We received a research permit from the Seychelles Bureau of Standards (ref #A0347) for our collection.European institutions currently host over 360 A. gigantea [105].Here, we analyzed two female individuals living in Zurich Zoo, Switzerland.The individual named Hermania was used to create the reference genome, and the individual named Maleika arrived at Zurich Zoo in 1984 and lived there until her death in 2018.
The historic information surrounding the exact importation location from Aldabra is sparse or unknown.Sampling from Hermania was performed as described above, and sampling from Maleika was performed by using ~500 mg of muscle tissue sampled after veterinary necropsy and stored in absolute ethanol until DNA extraction.

DNA extraction, and sequencing
DNA extraction was performed with 3 µl of blood from Hermania and 15 mg of muscle tissue from Maleika, using the sbeadex™ kit (LGC Genomics, Middlesex, UK), following the manufacturer's protocol for DNA extraction from nucleated red blood cells and tissue, respectively.Genomic DNA concentrations were measured with a dsDNA Broad Range Assay Kit (Qubit 2.0 Fluorometer, Invitrogen, Carlsbad).More than 200 ng of DNA per sample was sent to Novogene Company (Cambridge, United Kingdom) for library preparation and sequencing.Briefly, the genomic DNA was randomly fragmented to a size of 350 bp, endpolished, A-tailed and ligated with Illumina adapters of Illumina sequencing.After PCR enrichment, products were purified (AMPure XP system) and checked for quality on an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA).Molarity was assessed using real-time PCR.Libraries were sequenced on the Illumina Novaseq 6000 platform with paired-end runs of 150 bp read length.For each of the 32 samples, ~2.6 Gbp raw reads were generated (Table S6, Table 1).

Data filtering, alignment, and genotype likelihood estimation
To account for the low coverage sequencing approach, we assessed genotype likelihoods using the Atlas Pipeline [106,107].We first used the GAIA workflow to remove Illumina adapters with TrimGalore v0.6.6 (Trim Galore, RRID:SCR_011847) [108] with default parameters.
Only reads longer than 30 bp were retained.Then, reads were aligned to the reference genome with BWA using BWA-MEM v0.7.17 (BWA, RRID:SCR_010910) [40] filtering for mapping quality scores >20.Alignments were processed with the RHEA workflow for indel realignment with GATK v3.8 (GATK, RRID:SCR_001876) [109].A target interval set was created with a representative set of 15 samples and each individual was realigned together with a representative set of individuals (guidance samples) to enable realignment of low coverage samples without jointly realigning all samples.The average read depth per sample was 1.62-2.06with a mean of 1.79 (Table S6).
Nucleotides with base qualities below 20 were discarded.Excessive SNPs around indels and excessive mismatches with the reference were corrected (C50, baq 1, [112].Sites with read coverage in fewer than 50% of the samples were excluded (minimum representation among samples > 50%, -minInd 16).SNPs with a genotype likelihood p-value < 0.001 were retained, producing a final set of 7,131,506 variant sites.

Population genetic structure and individual assignments
Our low coverage sequencing analyses focused on revealing within and among island genetic differentiation in the Aldabra Atoll, as well as assigning likely origins for zoo-housed individuals.We first assessed the global genetic structure of the samples using a principal component analysis with PCAngsd v09.85 [113].Based on a total of 6,651,907 variant sites with a minor allele frequency >0.05, individuals from Malabar and Grande Terre split into individual groups (Fig. 3B).Both zoo samples were grouped within the group of Grande Terre individuals revealing the most likely origin for these individuals captured in the 20th century.
The principal component analysis also reveals a finer scale east-west population structure within islands confirming recent results based on ddRAD sequencing [114].
We also assessed genetic structure using unsupervised Bayesian clustering with NGSAdmix

Conclusions
We assembled the first high-quality, chromosome-level annotated genome for the Aldabra giant tortoise, resulting in one of the best assembled chelonian genomes.Chromosomal collinearity analyses revealed a high degree of conservation even among distantly related tortoise species.We showed that the high-quality resources can be combined with low-Grüter and Weihong Qi from the Functional Genomics Center, ETH, Zurich for their help in getting sequencing services.

Tables
Table 1 Summary of the genomic data produced in this study Synthesis & Amplification Module (NEB) and Iso-Seq Express Oligo Kit (Pacific Biosciences) following instructions.To enrich for longer transcripts (>3 kb), 82 µl ProNex Beads were used for the clean-up of the amplified DNA, as outlined in the protocol.For all subsequent quality control steps, a Bioanalyzer 2100 12 Kb DNA Chip assay (Agilent) and a Qubit Fluorometer (Thermo) were used to assess the size and concentration of the library.The SMRT bell-Polymerase Complex was prepared using the Sequel Binding Kit 3.0 (Pacific Biosciences) and sequenced on a PacBio Sequel instrument using the Sequel Sequencing Kit 3.0 (Pacific Biosciences).In total, one Sequel™ SMRT® Cell 1M v3 (Pacific Biosciences) was run with one movie of 20 hours per cell producing ~1.1 Gbp of HiFi data (Table NCBI RefSeq: GCF_003597395.1).This dataset was aligned against the chromosome-level assembled reference genome via the ProtHint pipeline v2.6.0 (ProtHint, RRID:SCR_021167)[68,69]. RNA-seq and homology-based evidence were incorporated for the braker2 pipeline (BRAKER, RRID:SCR_018964) run in --etpmode[62,68,[70][71][72][73][74].All gene models derived from ab initio and evidence-based methods were integrated into a high confidence non-redundant gene set by using TSEBRA v1.0.3[75,76], with the "keep ab_initio" configuration set.The translated protein sequences from the predicted gene models were searched against protein profiles corresponding to major clades/families of transposon open reading frames by TransposonPSI v1.0[77].Overall, 331 genes were identified as likely derived from transposable elements and excluded from the annotation.The resulting gene model set consisted of 23,953 protein-coding genes with a mean gene length of 39,458 bp and an average of nine exons per coding sequence (

(
NGSadmix, RRID:SCR_003208)[115].We performed pairwise linkage disequilibria (LD) pruning to reduce dependence among SNP loci[115].Pairwise LD was calculated using ngsLD[116] and LD pruning was performed by allowing a maximum among-SNP distance of 100 Kbp and a minimum weight of 0.5.After LD pruning, 5,862,629 SNPs were retained and 50 replicate runs of NGSAdmix (NGSadmix, RRID:SCR_003208)[115] were performed.We varied the number of clusters (k) between 2-5 and visualized the assignments with PopHelper v.1.0.10[117,118] (Fig. 3C).The admixture analyses for k=2 clusters revealed a main split with groups formed by East Grande Terre together with East Malabar opposed to West Malabar.South & West Grande Terre individuals were assigned to both groups.At k=4 each major sampling region was assigned to a single cluster.The zoo individual Maleika showed a genotype highly consistent with South & West Grande Terre individuals.The individual Hermania was assigned to different Grande Terre regions.

Fig. 1 A
Fig. 1 A) A female Aldabrachelys gigantea resting at La Vanille Nature Park, Mauritius.B) World map showing the location of Aldabra Atoll.C) Hi-C contact map of the chromosomelevel assembled Aldabrachelys gigantea Blue boxes represent assembled pseudo-chromosomes and green boxes represent assembled scaffolds that constitute pseudochromosomes.D) BUSCO completeness scores for the sauropsida dataset and E) Assembly metrics (including length of the longest scaffold N50 and N90) and sequence composition (GC content) of the chromosome-level Aldabrachelys gigantea genome.

Fig. 3 A
Fig. 3 A) The location of the Aldabra Atoll in the Western Indian Ocean and sampling location of 30 individuals within Aldabra Atoll.Every colored mark on the map represents a sampled tortoise.B) Principal component analysis plot of 30 Aldabrachelys gigantea individuals from the islands of Grande Terre (East, light blue circles; South & West, green circles) and Malabar (East, dark blue triangles; West, purple triangles) and two individuals from the Zurich Zoo (Hermania, black diamond; Maleika, light pink square).Principal components 1 and 2 account for 14.2% and 3.64% of the overall genetic variability, respectively.C) Admixture proportions of all the individuals for ancestral populations (k) varied from 2 to 5. Each bar represents one individual and shows its admixture proportions.

Table
[43,44]ompleteness of the annotation was assessed based on single-copy orthologs via BUSCO v5.1.2(BUSCO,RRID:SCR_015008)[43,44]withdefault parameters in the protein mode.The summarizing the properties of the structural annotation and for combining the structural and functional annotation results.Of all the prediction gene models, 94.1% could be functionally annotated (Table

Table 2
Contiguity and completeness statistics of all available chromosome-level assembled chelonian genomes BUSCO score generated from the sauropsid (sauropsida_odb10) database.BUSCO statistics C, complete; S, single-copy; D, duplicated; F, fragmented; M, missing *