A high-quality genome assembly for the endangered golden snub-nosed monkey (Rhinopithecus roxellana)

Abstract Background The golden snub-nosed monkey (Rhinopithecus roxellana) is an endangered colobine species endemic to China, which has several distinct traits including a unique social structure. Although a genome assembly for R. roxellana is available, it is incomplete and fragmented because it was constructed using short-read sequencing technology. Thus, important information such as genome structural variation and repeat sequences may be absent. Findings To obtain a high-quality chromosomal assembly for R. roxellana qinlingensis, we used 5 methods: Pacific Bioscience single-molecule real-time sequencing, Illumina paired-end sequencing, BioNano optical maps, 10X Genomics link-reads, and high-throughput chromosome conformation capture. The assembled genome was ∼3.04 Gb, with a contig N50 of 5.72 Mb and a scaffold N50 of 144.56 Mb. This represented a 100-fold improvement over the previously published genome. In the new genome, 22,497 protein-coding genes were predicted, of which 22,053 were functionally annotated. Gene family analysis showed that 993 and 2,745 gene families were expanded and contracted, respectively. The reconstructed phylogeny recovered a close relationship between R. rollexana and Macaca mulatta, and these 2 species diverged ∼13.4 million years ago. Conclusion We constructed a high-quality genome assembly of the Qinling golden snub-nosed monkey; it had superior continuity and accuracy, which might be useful for future genetic studies in this species and as a new standard reference genome for colobine primates. In addition, the updated genome assembly might improve our understanding of this species and could assist conservation efforts.


Background
The snub-nosed monkeys (genus Rhinopithecus) consist of 5 endangered species narrowly restricted to China and Vietnam [1]. Of those, the golden snub-nosed monkey (Rhinopithecus roxellana, NCBI:txid61622), also known as the Sichuan snub-nosed monkey, has the northernmost distribution of all Asian colobine species; this monkey is found only in 3 isolated regions in central and northwest China (the Sichuan, Gansu, Shaanxi, and Hubei Provinces) [2,3]. The golden snub-nosed monkey is characterized by several distinctive traits, including golden fur, a blue facial color, an odd-shaped nose, and folivory ( Fig. 1). In addition, the species has a unique multilevel social system; such complex systems are found only in a few mammals, including humans [4]. Therefore, the Qinling golden snub-nosed mon-key is an ideal model for the analysis of social structure evolution in primates and may also provide opportunities to investigate evolutionary and socio-anthropological patterns of human society.
Based on morphological variations and discontinuous distributions, R. roxellana is distinguished into 3 subspecies: R. roxellana roxellana from the Minshan Mountain in the Sichuan and Gansu Provinces, R. roxellana qinlingensis from the Qinling Mountain in Shaanxi Province, and R. roxellana hubeiensis from Shennongjia Mountain in Hubei Province [3]. Recent studies of R. roxellana have focused on behavioral dynamics, population history, and social systems [5][6][7]. To date, only a single genome assembly is available for the golden snub-nosed monkey. This assembly, published in 2014, was derived from short sequencing reads generated by the Illumina HiSeq 2000 platform [8]. Several studies have been published based on these data, including analyses of the folivorous dietary adaptations of R. roxellana and its evolutionary history [8][9][10]. Despite the utility of these previously published data, much relevant information, including structural variations and repeat sequences, is largely absent or unreliable owing to the incomplete and fragmented genome assembly [11,12].
Owing to advances in sequencing technology, it is now possible to obtain high-quality genome assemblies that can provide new insights in organismal research. Indeed, many previously unreported transposable elements and specific genes in maize were identified using an improved reference genome [13]. By combining new sequencing approaches, Seo et al. [11] discovered clinically relevant structural variants and previously unreported genes in the updated human genome. New sequencing technologies, including Pacific Bioscience's (PacBio's) singlemolecule real-time (SMRT) sequencing, BioNano optical mapping, and high-throughput chromosome conformation capture (Hi-C)-based chromatin interaction maps, have been used in several species closely related to humans, including gorillas (Gorilla gorilla gorilla) [14], chimpanzees (Pan troglodytes) [15], and Sumatran orangutans (Pongo abelii) [15], as well as in other species, including the domestic goat (Capra hircus) [16]. Importantly, it was estimated that 87% of the missing reference exons and incomplete gene models were recovered using the new gorilla assembly [14]. In addition, several novel genes expressed in the brain were identified using the new orangutan assembly, and complete immune genes with longer repetitive structures were identified in the updated goat genome [16]. However, the R. roxellana genome has not yet been updated using new sequencing approaches, slowing progress towards a better understanding of this endangered species.
Here, we report a greatly improved assembly of the reference genome for R. roxellana generated by a combination of 5 technologies: SMRT sequencing from PacBio, HiSeq paired-end sequencing from Illumina (HiSeq), BioNano optical maps (Bio-Nano), 10X Genomics link-reads (10X Genomics), and Hi-C. Our results represent the first colobine genome sequenced and assembled with both long reads and short reads. This updated genome assembly may allow us to further investigate R. roxellana, providing new opportunities to analyze evolutionary history and to identify genetic changes associated with the development of specific traits in this species. Such analyses may provide insights helpful for the conservation of this endangered primate. In addition, this genome, which has superior continuity and accuracy, will act as a new reference genome for colobine primates.

Sample collection and sequencing
The animal used for sequencing was an adult male R. roxellana qinlingensis from Qinling Mountain, who died of natural causes and then was stored shortly after death in an ultra-cold storage freezer at Louguantai Breeding Centre, Xi'an, Shaanxi Province, China. Total genomic DNA was extracted from the heart tissue. To acquire a high-quality genome assembly, we combined 5 sequencing methods. Initially, PacBio SMRT sequencing was performed on the SEQUEL platform following the manufacturer's instructions. After quality control, during which subreads shorter than 500 bp were removed, 304.84 Gb clean long reads (95.86× coverage) remained. The N50 length of the PacBio reads was 16.69 kb. Simultaneously, paired-end sequencing was performed using an Illumina NovaSeq 6000 platform, with an insert size of 350 bp. Then those short reads were filtered using the SOAP-denovo2 software [17], removing reads with adapters, contaminations, >10% unknown bases (N), or low quality. After filtering, 422.00 Gb clean reads remained (133.12× coverage). A highquality optical genome map was also constructed with the Irys platform (BioNano Genomics). The N50 length of the molecules used for optical mapping was 338 kb. The average BioNano optimal marker density examined was 11.66 per 100 kb, while the average marker density was 12.62 per 100 kb for the predicted map based on the assembled contigs. Thus, the observed Bio-Nano map was consistent with the predicted map. The Bio-Nano map generated 463.75 Gb of large DNA molecules. Next, 10X genomic linked-reads sequencing was performed on an Illumina Hiseq Xten platform, generating 340.90 Gb clean reads (109.56× coverage). Finally, a Hi-C library was prepared and sequenced with an Illumina NovaSeq 6000 platform to produce a chromosome-scale scaffolding of the genome assembly. Adapter sequences and low-quality reads were discarded using Cutadapt v1.0 [18] with the parameters "-e 0.1 -O 5 -m 100 -n 2 -pair-filter = both", yielding 307.90 Gb clean data (97.77× coverage). Detailed sequencing statistics are given in Table 1.

De novo assembly of the R. roxellana genome
An estimation of genome size would increase our understanding of R. roxellana and the challenges in sequencing it. Thus, we estimated the size of the R. roxellana genome as G = (K total − K error )/D, where G represents genome size, K total represents the total number of k-mers, K error represents the number of k-mers with sequencing errors, and D indicates the k-mer depth. We generated 109,210,004,556 k-mers, 1,159,024,556 of which had sequencing errors. The peak k-mer depth was 34. Thus, the genome size of R. roxellana was estimated to be ∼3.18 Gb. The distribution of k-mer frequencies is given in Supplementary Fig. S1.
The de novo assembly of the newly sequenced R. roxellana genome was performed in 4 progressive stages. First, long reads obtained from the PacBio platform were assembled as follows: detection of overlap and read correction, detection of overlap between pairs of corrected reads, and string graph construction. Assembly of the PacBio long reads was performed using FALCON (version 0.4.0, Falcon, RRID:SCR 016089) [19] with the parameter set "length cutoff = 5000, length cutoff pr = 5000, pa HPCdaligner option = -v -B128 -e.70 -k14 -h128 -l2000 -w8 -T8 -s700, ovlp HPCdaligner option = -v -B128 -e.96 -k16 -h480 -l1500 -w8 -T16 -s700". Next, the assembled PacBio contigs were polished using Quiver (SMRTLink version 5.1.0) with PacBio long reads [20], and also the contig assembly was corrected by Pilon-  Note: The "Number" column represents the number of contigs/scaffolds longer than the value of the corresponding category. n/a: not applicable.
1.18 (java -Xmx500G -jar pilon-1.18.jar -diploid -threads 30) with Illumina short reads [21]. The contig N50 of the initial assembly was 4.74 Mb (Supplementary Table S1). Using the initial genome assembly, SSPACE-LongRead v1-1 [22] was implemented for getting a longer scaffold by processing PacBio long reads and the initial genome assembly with the command "perl SSPACE-LongRead.pl -c <contig-sequences> -p <pacbio-reads>". This procedure generated a genome assembly with scaffold N50 of 7.81 Mb (Supplementary Table S2). The remaining gaps in the assembly were closed using the PBjelly module in the PBSuite (version 15.8.24) [23] with default settings. Thus, at the end of the first stage, the genome assembly had a contig N50 of 5.72 Mb and a scaffold N50 of 8.20 Mb (Supplementary Table S3).
In the second stage, the BioNano molecules were filtered, requiring a minimum length of 150 kb and minimum of 9 labels per molecule. Then, a genome map was assembled de novo with IrysView (version 2.3; BioNano Genomics), based on the optically mapped molecules. The assembled PacBio scaffolds were input into hybridScaffold [24]. In brief, the hybrid scaffolding process included the alignment of the PacBio scaffolds against the Bio-Nano genome maps, followed by the identification and resolution of conflicting alignments. At the end of stage 2, the hybrid genome assembly had a scaffold N50 of 9.22 Mb (Supplementary  Table S4).
In the fourth stage, the Hi-C data were used to build chromosome-level assembly scaffolds. In brief, Hi-C sequencing data were first aligned to the assembled genome using BWA [28]. Scaffolds were then clustered, ordered, and oriented using Lachesis [29], with the parameter set "CLUS-TER MIN RE SITES = 1800, CLUSTER MAX LINK DENSITY = 4, and CLUSTER NONINFORMATIVE RATIO = 0" This procedure generated 22 accurately clustered and ordered pseudochromosomes, with a genome size of 3.04 Gb, a contig N50 of 5.72 Mb, and a scaffold N50 of 144.56 Mb ( Table 2). The pseudochromosomes were divided into 100-kb bins and the interaction frequencies between pairs of 100-kb genomic regions were determined (Fig. 2).

Assessment of the genome newly assembled
We evaluated our newly assembled R. roxellana genome against the previously published assembly. The contiguity of our R. roxellana genome was 100-fold greater (contig N50: 5.72 Mb; scaffold N50: 144.56) than the previous version (contig N50: 25.5 kb; scaffold N50: 1.55 Mb) [8]. We also aligned our genome against the previous version using MUMMER (v4.0.0beta2) [30] and identified 6,452 gaps in the previous version that were predicted to be filled by >29.7 Mb of sequence in our new assembly. These filled gaps were mainly located in the intergenic and repetitive regions, with a small fraction of the sequence data annotated as gene regions. Our new assembly also had a higher proportion of repeat sequences (50.82%) than the previous version (46.15%); in particular, the number of LINE (long interspersed elements) transposable elements and tandem repeats was greatly increased (further details are given below, in the "Identification of repeat elements" section). Thus, the newly assembled genome was substantially more complete and continuous. It was likely that the remarkable improvement in contiguity was due to the increased read length, deeper sequencing depth, improved gap assembly, and more sophisticated assembly algorithm.
To assess the accuracy of our genome assembly, we aligned the Illumina short reads to the assembly using BWA [28], with the parameters "-o 1 -i 15". Approximately 99.17% of the short reads were mapped to the genome assembly. Further investigations indicated that these reads covered ∼99.27% of the total assembly (Supplementary Table S6). Genome assembly accuracy was also measured using the standard variant calling method in samtools [31], with the command "samtools mpileup -q 20 -Q 20 -C 50 -uDEf". We found a total of 7,690 homozygous non- reference single-nucleotide polymorphisms, which reflected a low homozygous rate (0.0004%), suggesting that our genome assembly was highly accurate (Supplementary Table S7). In addition, we estimated assembly completeness using BUSCO (RRID: SCR 015008) v3.0.2 [32], with the parameters "-i -o -l -m genome -f -t" based on mammalia odb9 (creation date: 13 February 2016; number of species: 50; number of BUSCOs: 4,104). BUSCO analysis identified 4,104 mammalian BUSCOs in the newly assembled R. roxellana genome: 94.0% complete BUSCOs, 2.9% fragmented BUSCOs, and 3.1% missing BUSCOs (Supplementary Table S8). Assembly completeness was also measured using the core eukaryotic gene (CEG)-mapping approach (CEGMA v2.5) [33]. Of the 248 CEGs known from 6 model species, 93.95% (233 of 248) were identified in our new genome assembly. Of these, 220 CEGs were complete and unfragmented, and the remaining 13 were complete but fragmented (Supplementary Table S9). Together, these analyses indicated that our new genome assembly was highly accurate and complete.

Identification of repeat elements
Repeated sequences account for a large proportion of the total genome. It is thus important to identify repeat elements. Here, we predicted and classified repeat elements both based on homology and de novo. In the homology approach, we searched the genome for repetitive DNA elements (as listed in the Repbase database v16.02) using RepeatMasker v4.0.6 (RepeatMasker, RRID:SCR 012954 [34]) [35] with the parameters "-a -nolow -no is -norna -parallel 1" and using RepeatProteinMask (implemented in RepeatMasker). To identify repetitive elements de novo, we used RepeatModeler v1.0.11 (RepeatModeler, RRID:SCR 015027) [36], with the parameters "-database genome -engine ncbi -pa 15". Tandem repeats in the genome were detected using Tandem Repeat Finder (TRF) v4.07b [37], with parameters "2 7 7 80 10 50 2000 -d -h". We merged the results of the 2 methods. In total, the new genome assembly comprised 50.81% repetitive sequences (Supplementary Table S10). Closer investigation indicated that the largest categories of repeat elements in the R. roxellana genome were the short and long interspersed nuclear elements (SINEs and LINEs, respectively). In addition, several repeat elements absent from the Repbase database were detected in the de novo approach (Supplementary Table S10). The total length of these repeat elements was 186,195,432 bp, accounting for 6.13% of the genome, suggesting that these repeat elements may be specific for R. roxellana. Compared with the repeat sequences in the previous assembly, our genome included relatively more LINE transposable elements (28.23% vs 6.21%) and tandem repeats (6.20% vs 2.82%). The detailed categories of repeat elements are summarized in Supplementary Table S11.

Duplicate sequence identification
We also performed duplicate sequence identification analysis, which was fulfilled based on the read depth of Illumina short reads. In brief, we first mapped the Illumina short reads to the assembled genome using BWA with default parameters. Then, the sorted mapping bam file was used as input for CNVnator v0.3.3 [38], a tool targeting alterations in the read depth, with the parameters of "-unique -his 100 -stat 100 -call 100". The obtained duplicate sequences were filtered, retaining only those  Note: Pasa-update * includes only the untranslated regions; other regions were not included. Final set * represents the results after the Pasa filtering process, where the longest isoform was chosen in the case of multiple splicing isoforms; redundant single exons were also discarded. The "Number" column gives the number of protein-coding genes predicted by each method. where q0 was <0.5 and e-val1 was <0.05. After filtering, 676 duplicate sequences remained, with a total length of 9,198,900 bp (Supplementary Table S12). Further analysis showed 101 duplications located at the end of scaffolds (5% of the total length in both ends). And there were 136 genes present in the duplicated regions-these genes mainly involved in basic biological processes such as ribonucleoside binding, phosphatase activity, and protein dephosphorylation.
The genes predicted by each of the 3 approaches were merged using EVidenceModeler (EVidenceModeler, RRID:SCR 014659) [54] with the parameters "-segmentSize 200 000 -overlapSize 20 000". We weighted transcript predictions most highly, followed by homology-based predictions and ab initio predictions. Untranslated regions and alternative splicing of the predicted gene were explored using PASA, in conjunction with the transcriptome data [51]. In total, 22,497 genes were predicted in the R. roxellana genome (Table 3), each containing a mean of 7.71 exons. The detailed results of the gene prediction process are given in Table 3 and Fig. 3.
We also compared the gene structure, including mRNA length, exon length, intron length, and exon number, among R. roxellana and other representative primates (e.g., H. sapiens, G. gorilla, M. mulatta, R. bieti, and R. roxellana hubeiensis). We found that genome assembly patterns were similar among R. roxellana and the other primates ( Supplementary Fig. S2).
In addition, we estimated the genome assembly completeness using transcriptome data. The transcripts were derived from the de novo assembly with trinityrnaseq-2.1.1 mentioned above. Those transcripts were clustered into unigenes with the help of using TGICL (TIGR gene indices clustering program, v2.1) [60] with 95% identity similarity cut-off. The generated unigenes were aligned to our assembly version and previous version us-ing BLAT version 36 (BLAT, RRID:SCR 011919). Results showed that the completeness degree (percentage of unigenes aligned to a single scaffold in the genome) was higher in our assembly (95.35%) than in the previous assembly (89.28%) for unigenes >1,000 bp (Supplementary Table S15), demonstrating the contiguity of our new assembly.

Phylogenetic analysis and gene family estimation
The coding regions and protein sequences of 11 representative mammals were downloaded from Ensembl (Ensembl Release 75). For genes with multiple transcript isoforms, the longest was chosen. Treefam [61] was used to estimate gene families. Using an all-to-all blast, we identified 17,560 gene families. We reconstructed the phylogenetic relationships among R. roxellana and other mammals based on 4-fold degenerate sites extracted from the 5,418 single-copy gene families. Phyml v3.2 (PhyML, RRID: SCR 014629) [62] was used to construct a maximum-likelihood tree using the general time reversible + γ model, as inferred by JMODELTEST v2.1.10 (jModelTest, RRID:SCR 015244) [63]. We estimated divergence times with MCMCTREE in PAML v4.8 (PAML, RRID:SCR 014932) [64], using the Bayesian method and the fossil calibration times from TimeTree [65,66] (Fig. 4).
To investigate the evolutionary history of R. roxellana, we estimated the expansion and contraction of gene families in this species with CAFE 3.0 (CAFÉ, RRID:SCR 005983) [67]. A random birth and death model was used to study gene family variations along each lineage in the phylogenetic tree. This analysis identified 993 expanded gene families and 2,745 contracted gene families in the R. roxellana genome (Fig. 4). To determine the significance of each gene family, P-values in each lineage were estimated by comparing conditional likelihoods derived from a probabilistic graphical model. All gene families with Pvalues < 0.05 were further analyzed. To explore the significantly expanded gene families, we performed a GO-term enrichment analysis with EnrichPipeline32 [68,69], using the 1,370 genes belonging to the 314 significantly expanded gene families as input, and using all predicted genes as background. We considered a GO term significant if after adjustment the P-value was <0.05. We found that the significantly expanded gene families were mainly associated with the hemoglobin complex, energy metabolism, and oxygen transport (Supplementary Table S16).

Conclusion
In this study, we generated a high-quality genome assembly for the golden snub-nosed monkey (R. roxellana) using a combination of 5 advanced genomics technologies. Our results will inform studies of the origin and evolutionary history of the snubnosed monkey. In addition, this genome may provide a framework within which to survey the mechanisms underlying the formation of the distinct morphological and sociological characters of R. roxellana. This genome may also stimulate new insights into the improvement of strategies to conserve and manage this endangered species. Finally, this genome, which has superior continuity and accuracy, may serve as a new standard reference genome for colobine primates.