The genome of Przewalski’s horse (Equus ferus przewalskii)

Abstract The Przewalski’s horse (Equus ferus przewalskii) is an endangered equid native to the steppes of central Asia. After becoming extinct in the wild multiple conservation efforts convened to preserve the species, including captive breeding programs, reintroduction and monitoring systems, protected lands, and cloning. Availability of a highly contiguous reference genome is essential to support these continued efforts. We used Oxford Nanopore sequencing to produce a scaffold-level 2.5 Gb nuclear assembly and 16,002 bp mitogenome from a captive Przewalski’s mare. All assembly drafts were generated from 111 Gb of sequence from a single PromethION R10.4.1 flow cell. The mitogenome contained 37 genes in the standard mammalian configuration and was 99.63% identical to the domestic horse (Equus caballus). The nuclear assembly, EquPr2, contained 2,146 scaffolds with an N50 of 85.1 Mb, 43X mean depth, and BUSCO quality score of 98.92%. EquPr2 successfully improves upon the existing Przewalski’s horse reference genome (Burgud), with 25-fold fewer scaffolds, a 166-fold larger N50, and phased pseudohaplotypes. Modified basecalls revealed 79.5% DNA methylation and 2.1% hydroxymethylation globally. Allele-specific methylation analysis between pseudohaplotypes revealed 226 differentially methylated regions in known imprinted genes and loci not previously reported as imprinted. The heterozygosity rate of 0.165% matches previous estimates for the species and compares favorably to other endangered animals. This improved Przewalski’s horse assembly will serve as a valuable resource for conservation efforts and comparative genomics investigations.


Introduction
The Przewalski's horse (Equus ferus przewalskii), also called the tahki, is an endangered equid native to central Asia (King et al. 2014) whose lineage likely diverged from the domestic horse (Equus caballus) tens of thousands of years ago (King 2005;Der Sarkissian et al. 2015;MacHugh et al. 2017;Gaunitz et al. 2018).Equus ferus przewalskii has a distinct short and stocky build with dun coloring and an erect mane.These horses were initially native to the steppes of central Asia, and by the 19th century inhabited only Mongolia, Tibet, and China (Bouman 1986;Wakefield et al. 2002).Introgression from the domestic horse, loss of natural habitat and resources, harsh climates, and hunting contributed to a severe population bottleneck and subsequent extinction in the wild in the 1960s (Food, of the United Nations AO 1986;Wakefield et al. 2002).All current Przewalski's horses are the descendants of 12 wild-caught individuals and several domesticated horses (Bouman 1986;Wakefield et al. 2002;Turghan et al. 2022).
Massive conservation efforts have focused on preserving the genetic diversity of this endangered species (Ryder and Wedemeyer 1982;Ryder 1988;Bowling et al. 2003;King and Gurnell 2005;Goto et al. 2011;Orlando 2019).Targeted captive breeding and management programs, including published studbooks since 1959, have increased the Przewalski's horse population to over 2,000 individuals (Ryder 1988;Der Sarkissian et al. 2015;Turghan et al. 2022).Reintroduction efforts starting in the 1980s have successfully established wild herds in protected lands of Mongolia, China, and Kazakhstan (King and Gurnell 2005;Robert et al. 2005;Kaczensky et al. 2007;Xia et al. 2014;Jiang and Zong 2019;Bernátková et al. 2022;Tang et al. 2022;Turghan et al. 2022), and have improved the species' status from critically endangered to endangered (King et al. 2014).Recent cloning efforts through Przewalski's Revive & Restore Project have produced two males from a cryopreserved cell line in the San Diego Zoo Wildlife Alliance Frozen Zoo (Alliance 2022a(Alliance , 2022b;;Restore 2024), further bolstering captive breeding efforts.
Investigation of the differences between domestic and Przewalski's horse chromosome structures may facilitate improved understanding of chromosome fusion.
Recent technological advances have increased accessibility and reduced costs for generating high-quality whole-genome sequencing data (Wang et al. 2021).Long-read Oxford Nanopore sequencing provides genomic and epigenomic data simultaneously without amplification or bisulfite conversion (Arakawa 2023).Two major limitations of the technology, high per-base error rates and homopolymer indels, have been addressed with improved basecalling models, self-correction algorithms, and adequate depth of coverage (Delahaye and Nicolas 2021).In addition to direct capture of epigenetic base modifications, pseudohaplotype phasing permits heterozygosity estimation and allele-specific DNA methylation analysis; strict allele-specific methylation is a signature of genomic imprinting, where genes are monoallelically expressed based on parent of origin (Bajrami and Spiroski 2016;Elhamamsy 2017).The epigenome is also relevant to species evolutionary biology due to the increased mutation rate of methylated cytosines via deamination (Jaenisch and Bird 2003;Xia et al. 2012;Skvortsova et al. 2018;Yamaguchi 2019;Chang and Lu 2020;Xu et al. 2022).
Here, we highlight the exclusive use of Oxford Nanopore sequencing reads to provide a high-quality, highly contiguous diploid nuclear genome assembly, updated mitogenome, and DNA methylation analysis for the endangered Przewalski's horse.

Sample collection
Varuschka, a 10-year-old captive-bred Przewalski's mare residing at the Minnesota Zoo (Apple Valley, MN, USA; Fig. 1), was chosen for sampling.The mare was subject to routine veterinary care under anesthesia, during which 10 ml of whole blood was collected by zoo veterinarians.The Minnesota Zoo has been active in Przewalski's horse breeding and management, with over 50 foals born since the 1970s, and contributed a stallion to reintroduction efforts in Mongolia's Hustai National Park (Lessard 2008;Mallinger et al. 2023).

DNA extraction and sequencing
Genomic DNA was extracted from blood using a MagAttract Blood DNA/RNA Kit (Qiagen, Venlo, Netherlands) according to manufacturer's instructions.Sequencing was performed on a P2 Solo instrument (Oxford Nanopore Technologies, Oxford, UK) using a single PromethION R10.4.1 flow cell.We created two libraries using Ligation Sequencing Kit V14 (ONT catalog number SEQ-LSK-114) for native DNA sequencing.For the first library, 3 μ g of DNA in 100 μl elution buffer was sheared by passage through a 28 gauge needle 30 times, library prepped, then split into three aliquots of 15 μl each.For the second library, unsheared DNA was eluted and prepped as a single 15 μl library.The first aliquot of the sheared library was loaded onto the flowcell and run for 24 hours, after which the flowcell was washed using the manufacturer's wash kit.The remaining two sheared aliquots and the unsheared library were loaded and sequenced for the same duration on days 2, 3, and 4, respectively, with flowcell washes performed before each reload.Data were collected using 5 kHz MinKNOW version 23.07.12 (Oxford Nanopore Technologies, Oxford, UK).
Scaffolds for E. ferus przewalskii were assigned chromosome names based on homology to E. caballus synteny from EquCab3.0 as demonstrated previously (Ahrens and Stranzinger 2005).The homologous E. f. przewalskii scaffold to E. caballus chromosome 5 was split on its scaffold gaps with AWK, and the largest blocks syntenic to the p-and q-arms were labeled with their chromosome names (Supplementary File S1).The remaining smaller contigs syntenic to EquCab3.0 chromosome 5 were labeled as ChrUns in the scaffolded assembly.The FCS-adapter tool from the NCBI Foreign Contamination Screening program suite was used to detect and remove adapter and vector contamination from the final haplotype assemblies (https://github.com/ncbi/fcs).The quality of all draft assemblies was evaluated by detecting Benchmarking Universal Single-Copy Orthologs (BUSCOs) within the Cetartiodactyla lineage (Manni et al. 2021).The reference dataset used for searching, "cetartiodactyla_odb10", contained 13,335 genes shared by 22 species.We used the Compleasm package to calculate BUSCO scores as it is a faster, more accurate implementation of the package (Huang and Li 2023).We combined Compleasm's single and duplicate BUSCO counts to provide a direct comparison to the standard BUSCO program's 'complete' value.On average, Compleasm detected 3% more BUSCOs than the BUSCO tool v3.4.7 (Supplementary File S1).We calculated assembly N50, L50, and other statistics using the packages Assembly-Stats (https://github.com/sanger-pathogens/assembly-stats)and Quality Assessment Tool for Genome Assemblies (QUAST v5.2.0) (Mikheenko et al. 2018), with EquCab3.0 serving as the reference for the latter.Assembly statistics were compared with other publicly available equid genomes including the preexisting Przewalski's horse (Burgud), domestic horse (EquCab3.0),and plains zebra (Equus quagga; UCLA_HA_Equagga_1.0) references.

Variant calling and diploidization
Clair3 v1.05 (Zheng et al. 2022) was run to determine the number and type of variants.Heterozygosity was calculated by counting the number of variants divided by the total genome size.Variants were phased with Whatshap (Martin et al. 2016) and haplotagged with Longphase (Lin et al. 2022).BCFTools was used to swap out the phased variants in the primary assembly to generate the secondary pseudohaplotype assembly (https://github.com/samtools/bcftools).
For allele-specific DNA methylation, Modkit was applied to the phased modBAM generated by variant calling to generate 2 5 mC bedMethyl files separated by haplotag.Count filtering, normalization, tiling into 100 bp windows, and differential methylation testing were performed with MethylKit (Akalin et al. 2012;Wreczycka et al. 2017).Windows were deemed significantly differentially methylated regions (DMRs) if they had at least 10 CpGs, a mean absolute difference in methylation > = 50%, and a Benjamini-Hochberg-adjusted P-value <0.05.DMRs were annotated by identifying the nearest gene with BedTools (Quinlan and Hall 2010).DMRs identified between haplotypes were visualized with MethylArtist (Cheetham et al. 2022).

Assembly
We generated a total of 111 Gb of DNA sequencing data for Przewalski's horse with a read N50 of 10,829 bp and mean quality of Q18.49.The quality of each draft assembly was assessed with parameters including N50 (i.e.length of the shortest contig at 50% of the total assembly length), L50 (i.e.smallest number of contigs whose length sum to 50% of the total assembly length), and the count of benchmark universal single-copy ortholog (BUSCO) genes.Our initial Flye run yielded a 2.59 Gb draft assembly with 6,808 contigs and an N50 of 13.6 Mb (Table 1).With an L50 of 55, the majority of the genome was assembled into relatively few contigs representing large portions of chromosomes.
The initial assembly was polished, purged of duplicates, and manually curated to remove contigs with coverage below 15× and 500× as they are unlikely to represent single-copy nuclear regions (Flack et al. 2023).This procedure resulted in the removal of 4,012 contigs and 96 Mb of sequence.Polishing with Medaka increased the complete BUSCO score by reducing the percentage of duplicate BUSCOs by 0.01%.In our initial draft, the duplicate BUSCO count was already low at less than 1% of the total; haplotig purging reduced the number of duplicate BUSCOs by 0.11% and increased N50 from 13.70 Mb to 13.97 Mb.
We scaffolded the curated assembly onto the domestic horse genome (EquCab3.0);gaps were filled with reads placed by TGS-GapCloser where spanning reads could be identified.This final assembly, EquPr2, was 2.50 Gb in length with a chromosomelevel N50 of 85.1 Mb and L50 of 12. EquPr2 contained 306 gaps spanning 2.03 Mb, a 5-fold and 11-fold reduction in gap length compared with EquCab3.0 and Burgud, respectively (Table 1).
Our Przewalski's horse genome had a BUSCO completeness score of 98.92%, improving from 89.88% in the previous Przewalski's horse assembly with 25-fold fewer scaffolds and a 166-fold increase in N50.It is important to deconstruct the complete BUSCO score reported by most new genome assemblies into its component parts of single and duplicate copy percentages.Examining only the completeness score obscures the presence of haplotig missassemblies represented by high duplicate counts.For instance, the reference horse genome, EquCab3.0, is a highquality, highly contiguous assembly with a complete BUSCO score of 98.92%, the same as our Przewalski's horse assembly EquPr2.However, EquCab3.0 has a duplicate rate of 1.18% versus EquPr2's duplicate rate of 0.55%.Similarly, the plains zebra genome, UCLA_HA_Equagga1.0, is even more contiguous with an L50 of 9 but has a 3.34% BUSCO duplication rate.While these differences are relatively small, duplication misassemblies have contributed to erroneous gene gain and gene family expansion findings in high-heterozygosity vertebrate genomes (Ko et al. 2022).
The Przewalski's horse has 33 sets of chromosomes versus the domestic horse's 32 due to a Robertsonian translocation where E. caballus chromosome 5 is homologous to E. f. przewalskii chromosomes 23 and 24 (Ahrens and Stranzinger 2005).Other EquPr2 chromosomes were named based on homology to domestic horse consistent with previous karyotyping (Richer et al. 2008;Musilova et al. 2007).We manually split the EquPr2 scaffold homologous to E. caballus chromosome 5 at every gap from the contig-level assembly and named the largest contigs as chromosomes 23 and 24 based on homology to the p and q arms of its species ortholog.
Due to lack of positional certainty, the remaining 12 contigs (34.4 Mb total length) mapping to EquCab3.0's chromosome 5 were named as chromosome unknown (ChrUn).

Repetitive DNA
We used RepeatMasker to detect repetitive element content within the new assembly and compared it to the previous assembly, Burgud, and the domestic horse (Table 2).Given the genetic similarity of Przewalski's horse to the domestic horse, global transposon content is nearly identical across all categories.Identification of single repeat insertions facilitated by a more contiguous Przewalski's horse reference may be valuable for future comparative genomics investigations of these two closely related species.

Heterozygosity
Given the extreme population bottleneck that occurred during the near-extinction of Przewalski's horse, it is critical to understand the genetic diversity remaining for captive breeding efforts.To this end and due to a lack of samples from Varuschka's sire and dam, we called and phased EquPr2 variants to build an alternate pseudohaplotype assembly.Variant calling with Clair3 found 4,114,297 variants, the majority of which were single-nucleotide variants.Heterozygosity was estimated to be 0.165.This level of genetic diversity is concordant with previous microarray data from 9 Przewalski's horses where average heterozygosity was 0.168, the highest estimate among nine members of Equidae excluding the domestic horse (McCue et al. 2012).

Gene annotation
We detected 21,552 putative genes in EquPr2 with the ab initio gene prediction tool Gene Model Mapper (GeMoMa) (Keilwagen et al. 2016(Keilwagen et al. , 2018) ) using the EquCab3.0 annotation (GCF_002863925.1)as reference.The resulting protein BUSCO score was 85.6% complete (84.6% single copy and 1.0% duplicates); this score appears to be near GeMoMa's maximum performance based on previous demonstration of similar results for existing high-quality reference genomes (Flack et al. 2023).The EquPr2 annotation would likely be significantly improved by the application of NCBI's Eukaryotic Genome Annotation Pipeline (Thibaud-Nissen et al. 2016), which requires the availability of RNA-seq data.

Mitochondrial genome
The mitochondrial genome was extracted from the assembly and characterized with MitoHiFi, a tool designed for long-read mitochondrial contig identification and annotation (Uliano-Silva et al. 2022).The resulting mitogenome for EquPr2 (CM075423.1)was 16,602 bp in length, had 99.63% sequence identity to E. caballus isolate TN9488, and contained 37 genes in the standard mammalian   S1).The repeat copy number was 28 in EquCab3.0,22 in EquPr2, and 20 in the reference Przewalski's horse, Burgud.The EquPr2 mitogenome being 10 bp shorter than Burgud's stemmed from short insertions and deletions (1-6 bp) throughout the sequence in addition to the smaller difference in tandem repeat copy number.Delineating misassemblies from true variation in this highly variable region of the equine mitogenome will be valuable for future comparative analyzes.

DNA methylation
Oxford Nanopore sequencing can natively detect base modifications including 5-methylcytosine (5 mC).This feature has been applied in previous mammalian genome assemblies to evaluate allele-specific DNA methylation and identify known and putative novel imprinted genes (Flack et al. 2023).Globally, we found that whole blood leukocyte DNA methylation in Przewalski's horse was 79.5% methylated and 2.1% hydroxymethylated at CpG sites genome-wide.After filtering to include CpGs with at least 10× coverage in both pseudohaplotypes, there were 18,928,678 sites available to test for allele-specific differential methylation.
Counts were tiled into 100 bp windows; windows containing at least 10 CpGs with an absolute difference in methylation > = 50% and Benjamini-Hochberg-adjusted P-value <0.05 were deemed DMRs.With these parameters, we identified 226 DMRs between EquPr2 pseudohaplotyes with a mean absolute methylation difference of 64.1% (Fig. 3).Nearest features to DMRs included known imprinted genes (e.g.IGF2R Barlow et al. 1991;Willison 1991, INPP5F Choi et al. 2005, PEG Kuroiwa et al. 1996, and DIRAS3 Yu et al. 1999) and loci not previously reported as imprinted (e.g.DLG3).Seventy-seven DMRs (34.1%) directly overlapped predicted genes.As a consequence of random X-chromosome inactivation in a female animal, 111 of the 226 DMRs (49.1%) were located on ChrX.Imprinted genes are unique to therian mammals (i.e.among animals), influence growth, and can be perturbed by assisted reproductive technology (Hitchins and Moore 2003;Wood and Oakey 2006;Manipalviratn et al. 2009;Peters 2014;Karahan et al. 2023), making these data valuable for future investigations of Przewalski's horse evolution, comparative genomics, and conservation.

Table 1 .
Contiguity and quality statistics for draft E. ferus przewalskii assemblies, final EquPr2 assemblies, and existing equid reference genomes.