Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina)

Abstract Genomic data for wild species of the genus Bubalus (Asian buffaloes) are still lacking while several whole genomes are currently available for domestic water buffaloes. To address this, we sequenced the genome of a wild endangered dwarf buffalo, the lowland anoa (Bubalus depressicornis), produced a draft genome assembly and made comparison to published buffalo genomes. The lowland anoa genome assembly was 2.56 Gbp long and contained 103,135 contigs, the longest contig being 337.39 kbp long. N50 and L50 values were 38.73 and 19.83 kbp, respectively, mean coverage was 44× and GC content was 41.74%. Two strategies were adopted to evaluate genome completeness: (1) determination of genomic features with de novo and homology-based predictions using annotations of chromosome-level genome assembly of the river buffalo and (2) employment of benchmarking against universal single-copy orthologs (BUSCO). Homology-based predictions identified 94.51% complete and 3.65% partial genomic features. De novo gene predictions identified 32,393 genes, representing 97.14% of the reference’s annotated genes, whilst BUSCO search against the mammalian orthologs database identified 71.1% complete, 11.7% fragmented, and 17.2% missing orthologs, indicating a good level of completeness for downstream analyses. Repeat analyses indicated that the lowland anoa genome contains 42.12% of repetitive regions. The genome assembly of the lowland anoa is expected to contribute to comparative genome analyses among bovid species.


Introduction
The lowland anoa, Bubalus depressicornis (Smith 1827), is a wild dwarf buffalo endemic to Sulawesi and Buton Islands, where it can be found in sympatry with the mountain anoa, Bubalus quarlesi (Ouwens 1910). Both anoa species are currently classified as endangered with declining populations due to hunting and habitat loss (Burton et al. 2016). Because of their singular appearance, they were initially described in their own genus Anoa (Ouwens 1910). However, Anoa was not regarded as a valid genus in more recent classifications, in which both anoa species were ascribed to the genus Bubalus, together with the wild water buffalo-Bubalus arnee (Kerr 1792) and the tamaraw-Bubalus mindorensis (Heude 1888;Groves 1969;IUCN 2022). Molecular studies based on mitochondrial sequences have supported a sister-group relationship between B. depressicornis and B. quarlesi (Schreiber et al. 1999;Priyono et al. 2020). In addition, the mitogenome of the lowland anoa was found to be equally distant from those of the 2 types of domestic water buffalo, the river buffalo from the Indian subcontinent and Mediterranean countries and the swamp buffalo from China and Southeast Asia (Hassanin et al. 2012). Since the same phylogenetic pattern was recovered from the analyses of 2 nuclear datasets, one based on 30 autosomal genes and the other based on 2 genes of the Y chromosome, Curaudeau et al. (2021) have concluded the existence of 2 species of domestic buffaloes: Bubalus bubalis (Linnaeus 1758) for the river buffalo and Bubalus kerabau (Fitzinger 1860) for the swamp buffalo, which diverged during the Pleistocene at around 0.84 Mya. As discussed in Curaudeau et al. (2021), the 2 domestic species can easily be distinguished based on coat and horn characteristics (Castelló 2016), and they have different karyotypes: B. bubalis has 2n ¼ 50 chromosomes with a fundamental number (FN) equal to 58; whereas B. kerabau has 2n ¼ 48 chromosomes and FN ¼ 56 (Nguyen et al. 2008).
With rapid progress and cost reduction in sequencing technologies, many whole genomes of domestic bovid species have been sequenced. Whole-genome sequencing has allowed the identification of variants involved in domestication and genetic improvement for several livestock species such as cattle and buffaloes (Zimin et al. 2009;Canavez et al. 2012;Li et al. 2020;Rosen et al. 2020). Chromosome-level genome assemblies include those of the domestic cow, Bos taurus (Zimin et al. 2009), the domestic river buffalo, B. bubalis (Deng et al. 2016), the swamp buffalo, B. kerabau [reported as Bubalus carabanensis in Luo et al. (2020) but see Curaudeau et al. (2021) for further taxonomic information], the domestic Yak, Bos grunniens (Zhang et al. 2021) and the zebu cattle, Bos indicus (Canavez et al. 2012). Whereas a total of 8 chromosome-and scaffold-level genome assemblies are publicly available for domestic buffaloes, there are currently no genome data available for wild species of the genus Bubalus. To fill this gap, a biopsy of a living lowland anoa was used for nextgeneration sequencing, and a draft genome was assembled de novo for comparison to other buffalo genome assemblies available in international databases such as NCBI (National Center for Biotechnology Information) and BIG_GWH (Beijing Institute of Genomics Genome Warehouse database).

Materials and methods
DNA extraction, library preparation, and genome sequencing A living male adult of lowland anoa, named Yannick, was sampled at the Ménagerie du Jardin des Plantes of the Musé um national d'Histoire naturelle (MNHN, Paris, France; Fig. 1). A skin biopsy was performed in 2006 by a veterinary surgeon following protocols approved by the MNHN and in line with ethical guidelines. The same biopsy was previously used to determine its karyotype (2n ¼ 48; FN ¼ 58; Nguyen et al. 2008). DNA was extracted using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany) following the manufacturer's protocol. DNA quantification was performed with a Qubit 2.0 Fluorometer with Qubit dsDNA HS Assay Kit (Thermo Fischer Scientific, Walthan, MA, USA). Library preparation and sequencing were conducted at the Institut du Cerveau et de la Moelle épinière. The sample was sequenced on a NextSeq 500 Illumina system generating 2 Â 151 bp reads using the NextSeq 500 High Output Kit v2 with 300 cycles and aiming for an insert size of 350 bp.

Assembly quality assessment
Genome assemblies publicly available for Bubalus and Syncerus genera were retrieved from NCBI and BIG_GWH for quality comparison and assessment. The dataset included 2 assemblies at the chromosome level for the river buffalo (B. bubalis) with a coverage of 100Â and 572Â, 4 scaffold-level draft assemblies of river buffalo with coverage ranging between 69Â and 119Â, one chromosome-level assembly of swamp buffalo (B. kerabau) with a mean coverage of 65Â, and one scaffold-level draft assembly of the African buffalo (Syncerus caffer) with 162Â coverage. The 8 retrieved assemblies were sequenced and assembled with different methods, summarized in Table 1.
The quality of the lowland anoa genome assembly was assessed with QUAST-LG v.5.0.1 (Mikheenko et al. 2018) using the river buffalo NDDB_SH_1 genome assembly (Deng et al. 2016) as a reference. The default parameters for mammalian genomes were used to compare all assemblies in QUAST-LG: "MODE: large, threads: 50, eukaryotic: true, minimum contig length: 3,000, minimum alignment length: 500, ambiguity: 1, threshold for extensive misassembly size: 7,000." All analyzed assemblies were aligned to the river buffalo NDDB_SH_1 assembly and results were plotted with Circos v. 0.69.8 (Krzywinski et al. 2009) and Jupiter consistency plots (Chu 2018). We adopted 2 different strategies to evaluate genome completeness. Firstly, genomic features were predicted with the homology-based method by aligning the lowland anoa genome to that of the annotated river buffalo reference genome (NDDB_SH_1 and relative annotations retrieved from NCBI). Secondly, we used a de novo gene prediction method with GlimmerHMM v3.0.4 (Majoros et al. 2004). Thirdly, we employed benchmarking against universal single-copy orthologs (BUSCO v5.2.2; Manni et al. 2021) using the mammalia_odb10 dataset (2021 February 19, number of genomes: 24, number of BUSCOs: 9,226) from OrthoDB (Kriventseva et al. 2019) and compared to other buffalo genome assemblies already deposited on NCBI and BIG_GWH (Table 1).

Repeats and gene annotation
Repetitive regions in the lowland anoa genome were identified, annotated, and masked with RepeatMasker v.4.1.2-p1  (Tarailo-Graovac and Chen 2009). Firstly, a de novo repeat library was constructed from the genome assembly with RepeatModeler v.2.0.2a. RepeatMasker was used with default parameters to produce a homolog-based repeat library and mask the genome's repetitive regions. The scripts "calcDivergenceFromAlign.pl" and "createRepeatLandscape.pl" were used to calculate the Kimura divergence values and to plot the resulting repeat landscape. The repeat landscape of B. taurus was retrieved from the RepeatMasker database for visual comparison.

Results and discussion
Whole-genome sequencing and data QC Whole-genome sequencing generated 991,437,058 paired-end reads with a length of 151 bp. Quality trimming removed 46,616,722 low-quality, adapter-contaminated, and PCRduplicated reads, representing approximately 0.5% of the total reads. A total of 944,820,336 clean paired-end reads were generated, covering the lowland anoa genome with an estimated 56Â depth based on a genome size of 2.56 Gbp. The estimation of insert size using in-house script returned a mean of 377 and a standard deviation of 83.
De novo assembly quality metrics The final lowland anoa genome assembly generated here contained 103,135 contigs, the largest being 337.39 kbp long, an N50 of 38.73 kbp and an L50 of 19.83 kbp ( Table 2). The total length was 2.56 Gbp with a mean coverage of 44Â, and GC content was 41.74%, in agreement with other published assemblies (between 41.60% and 41.92%, Table 3). When aligned to the NDDB_SH_1 genome assembly, the fraction of the anoa genome assembly was 95.41%, a value comparable to other buffalo genome assemblies (Fig. 2), with a total alignment length of 2,515,453,843 bp. A total of 886 contigs could not be aligned to the river buffalo genome assembly, whilst 8,085 contigs were only partially aligned, resulting in a total unaligned length of 45,224,171 bp, which reflects the discrepancy between the total length of the lowland anoa genome and the total aligned length to the reference river buffalo genome assembly. Partially aligned and unaligned contigs could have resulted from structural variations between the lowland anoa and the reference river buffalo assembly, such as large INDELS (insertion/deletions), as well as repetitive regions and/or alternative haplotypes causing assembly errors. The nature of short-read technology causes difficulties in characterizing genomic regions such as telomeres, centromeres, repetitive, and highly heterochromatic regions (Johnson et al. 2005;Low et al. 2019;Weissensteiner and Suh 2019), which are notoriously difficult to assemble and could be better resolved with long-read sequencing. The lowland anoa genome assembly has a modest N50 compared to other buffalo genome assemblies (Table 3), indicating lower levels of contiguity, which is expected due to the shortread output of Illumina sequencing technology (read length ¼ 151 bp). In addition, repeat analysis revealed that 42.12% of the lowland anoa genome is composed of repetitive regions. This, coupled with low-sequence coverage, sequencing and assembly errors, causes breaks in the assembly contiguity (Gnerre et al. 2011;Low et al. 2019). This is apparent even in high-quality chromosome-level genome assemblies that use multiple sequencing libraries and multiple sequencing technologies, such as the previous human genome assembly GRCh38, which contained hundreds of gaps (International Human Genome Sequencing Consortium 2004). In addition, the chromosome-level genome assemblies retrieved from NCBI (NDDB_SH_1, UOA_WB_1) were sequenced using multiple insert size libraries and sequencing technologies and were intensively verified with multiple methods such as optical mapping, Hi-C, and RH (Deng et al. 2016;Low et al. 2019).
Moreover, quality metrics of publicly available assemblies are usually limited to reporting N50 and L50 values, which represent the shortest contig length needed to cover 50% of the total assembly size, and the number of contigs whose cumulative length covers 50% of the total assembly size, respectively (Bradnam et al. 2013). Such metrics are often used to compare and evaluate performances of the ever-growing assembly and annotation methods and software (Manchanda et al. 2020). However, we hereby show that reporting N50 and L50 metrics exclusively can be misleading, as they only provide a standard measure of assembly contiguity whilst omitting information such as gene content and completeness, as well as assembly correctness. Furthermore, N50 values can be artificially raised by deliberately excluding short contigs from analyses and by the presence of undetermined nucleotides (Ns) linking the scaffolded contigs (Gurevich et al. 2013). Therefore, to assess the quality of the lowland anoa genome assembly, we generated conventional N50 and L50 metrics and also determined genome completeness in terms of gene content and genome correctness by comparing our assembly to a chromosome-level genome assembly of the river buffalo (B. bubalis). In addition, a swamp buffalo (B. kerabau, CUSA_SWP) and a more distantly related African buffalo species (S. caffer, ABF221) were also included in our comparison.
Regardless of the modest N50 value, the lowland anoa genome assembly is in good agreement with the NDDB_SH_1 assembly, with 95.91% of contigs correctly mapped to the 25 reference chromosomes of the river buffalo and fewer misassembled blocks compared to other draft assemblies (Fig. 3). The genome assembly of the Egyptian river buffalo (EGYBUF_1.0) had an abnormally high number of misassembled blocks with respect to the reference genome, followed by the genome assembly of a female Italian river buffalo (UOA_WB_1). To investigate this, misassemblies and structural variation metrics were computed in QUAST-LG (Table 4). The Egyptian river buffalo assembly (EGYBUF_1.0) showed the highest number of mismatches and the highest number of Ns, followed by the Jaffrabadi river buffalo (AAUIN_1). The genome assembly of the African buffalo (S. caffer, ABF221) showed a larger number of mismatches (Table 4), but this can be explained by the higher sequence divergence between Syncerus and Bubalus, as the 2 genera have separated in the Late Miocene (Hassanin et al. 2012). Misassemblies and structural variation metrics could not explain the misassembled blocks of the UOA_WB_1 assembly observed in the Circos plot of Fig. 3. However, some of these misassembled blocks could be due to unplaced contigs. To investigate this, the UOA_WB_1 assembly was aligned to the NDDB_SH_1 reference to generate Jupiter consistency plots. When using the largest 26 contigs of the UOA_WB_1 assembly to cover 100% of the reference river buffalo genome, an almost perfect level of synteny was observed (Fig. 4a). Although this result was expected for genomes of the same species, it also indicates a good level of assembly quality in terms of correctness. However, when including all 509 contigs of the UOA_WB_1 assembly, several misassembled regions were observed (Fig. 4b).
Three nonexclusive hypotheses can be advanced to interpret this result: possible genomic rearrangements, genome assembly errors, and repetitive regions. Whether the results of the consistency plots are due to the factors mentioned above or other factors, such as contamination, remains speculative. Nevertheless, the results of the quality metric comparison conducted here further indicate the unreliability of using exclusively N50 and L50 metrics when assessing assembly quality. Instead, contiguity metrics should be supplemented with genome completeness and correctness metrics.

Genomic features, gene prediction, and annotation
Homology and de novo gene predictions performed on the lowland anoa genome assembly were in agreement with each other and indicated a good level of genome completeness. Results were comparable to other published genome assemblies (Tables 5  and 6), and an improvement over the Bangladeshi river buffalo (Bubbub_1.0), the Egyptian river buffalo (EGYBUF_1.0), and Mediterranean river buffalo (UMD_CASPUR_WB_2.0) assemblies. Interestingly, these 3 assemblies showed higher contiguity (N50) than the draft assembly of the lowland anoa, further indicating the unreliability of using exclusively N50 and L50 metrics when assessing genome assembly quality.
Out of the 1,921,249 genomic features annotations of the reference assembly NDDB_SH_1, homology prediction identified 1,815,794 (94.51%) complete and 69,929 (3.63%) partial features in the lowland anoa genome assembly, which is comparable to other published assemblies (Fig. 5), indicating a good level of genome completeness. GlimmerHMM de novo predicted 1,027,469 unique genomic features (mRNA and coding sequences, CDS), which is an improvement over some of the water buffalo assemblies used for quality comparison (Table 5). Homology-based gene prediction identified 32,393 genes in the lowland anoa genome assembly, representing 97.14% of the genes annotated in NDDB_SH_1 (n ¼ 33,348). Of these, 59.11% (19,148) were complete and 40.88% (13,245) were partial, probably reflecting the level of fragmentation  2. Cumulative length of aligned contigs of the lowland anoa (red line) against the river buffalo NDDB_SH_1 reference genome assembly (dashed line) and compared to other buffalo genome assemblies available on NCBI. Fig. 3. Circos plot of scaffolds mapped to NDD_SH_1 reference genome assembly (Bubalus bubalis). Outer circle represents reference sequence with GC% heatmap (0% ¼ white, 69% ¼ black). Inner circles represent assembly tracks, with heatmap representing correct contigs (green) and misassembled blocks (red).  of the lowland anoa genome assembly. Nevertheless, the total number of genes predicted still represents an improvement over some of the compared assemblies (Table 6). When predicting mammalian orthologs with BUSCO, the lowland anoa genome assembly contained 6,556 (71.1%) complete BUSCOs, of which 6,412 (69.5%) were single copy and 144 (1.6%) were duplicated. The number of fragmented BUSCOs was 1,076 (11.7%), whilst 1,594 (17.2%) were missing. The BUSCO results indicate an acceptable level of genome completeness (<70%, Simão et al. 2015) for downstream analyses for the anoa genome assembly, and a slight improvement over the Egyptian river buffalo assembly (EGYBUF_1.0, Fig. 6).
Mammalian genomes contain large families of repeats (Goodier and Kazazian 2008), such as long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and long-terminal repeats (LTRs). RepeatMasker revealed that  42.12% of the lowland anoa genome is composed of repetitive regions (Table 7), which is comparable to data previously published for genome assemblies of river buffalo and other bovids (Deng et al. 2016;Low et al. 2019;Mintoo et al. 2019;El-Khishin et al. 2020). Results also agree with the repetitive content in the cattle genome (Fig. 7b). Both lowland anoa and cattle genomes showed 2 waves of repeat expansion in their repeat landscape (Fig. 7, a and b), suggesting a shared inheritance of such repeats. In the lowland anoa, the LINEs were more abundant, representing 30.04% of the repeats, followed by LTRs representing 3.10% and SINEs representing 1.03% (Table 7).

Conclusion
To date, whole-genome sequencing has allowed the identification of variants involved in domestication and genetic improvement  . 7. Interspersed repeat landscape of (a) the lowland anoa genome assembled in this study and (b) Bos taurus.
for several livestock species (Zimin et al. 2009;Canavez et al. 2012;Li et al. 2020;Rosen et al. 2020). However, the lack of wild buffalo genomes hinders further analyses addressing functional and evolutionary aspects of this group, as well as possible conservation efforts. The draft genome assembly of the lowland anoa reported here is expected to contribute to this gap in data availability, as this is the first draft genome assembly for wild Asian buffaloes. Furthermore, we showed that short-read Illumina sequencing data can still provide a cost-effective way of sequencing mammalian genomes to an adequate level of completeness for downstream comparative analyses.

Data availability
The raw data and assembly are available on NCBI under BioProject PRJNA849775. The genome assembly of the lowland anoa is available on NCBI under BioSample accession SAMN29133250. The raw data are available on the Sequence Read Archive (SRA) on NCBI under accession SRR21016826.