Chromosome-level de novo assembly of the pig-tailed macaque genome using linked-read sequencing and HiC proximity scaffolding

Abstract Background Macaque species share >93% genome homology with humans and develop many disease phenotypes similar to those of humans, making them valuable animal models for the study of human diseases (e.g., HIV and neurodegenerative diseases). However, the quality of genome assembly and annotation for several macaque species lags behind the human genome effort. Results To close this gap and enhance functional genomics approaches, we used a combination of de novo linked-read assembly and scaffolding using proximity ligation assay (HiC) to assemble the pig-tailed macaque (Macaca nemestrina) genome. This combinatorial method yielded large scaffolds at chromosome level with a scaffold N50 of 127.5 Mb; the 23 largest scaffolds covered 90% of the entire genome. This assembly revealed large-scale rearrangements between pig-tailed macaque chromosomes 7, 12, and 13 and human chromosomes 2, 14, and 15. We subsequently annotated the genome using transcriptome and proteomics data from personalized induced pluripotent stem cells derived from the same animal. Reconstruction of the evolutionary tree using whole-genome annotation and orthologous comparisons among 3 macaque species, human, and mouse genomes revealed extensive homology between human and pig-tailed macaques with regards to both pluripotent stem cell genes and innate immune gene pathways. Our results confirm that rhesus and cynomolgus macaques exhibit a closer evolutionary distance to each other than either species exhibits to humans or pig-tailed macaques. Conclusions These findings demonstrate that pig-tailed macaques can serve as an excellent animal model for the study of many human diseases particularly with regards to pluripotency and innate immune pathways.

The manuscript would benefit from some comparison of how much better the genome assembly is relative to previous assembly. For example, a comparative repeat analysis, or what specific regions of the assembly were absent in the previous version, what do they contain, how many gaps were filled.
This combination method increased the scaffold N50 largely, however, the statics at the contig level were not reported.
Page 14. The sentence "Here we present, to our knowledge, the first linked-reads and HiC-based assembly of a primate genome." in the discussion section may be not right. In my knowledge, these technologies have been used in the genome assembly of Rhinopithecus roxellana, a golden sub nosed monkey.
The version of genome assembly for other species used for genome comparison should be detailed.
The legend for figure 3 is not adequate. What is the reader supposed to conclude from the figure?
Reviewer #2: The paper used an HiC approach to capture the three dimensional structure of the chromosomes inside the cell, leading to much more contiguous reconstruction of the genome that was previously possible. The resulting pig-tailed macaque genome covers 90% of the combined chromosome length, and is similar to the human genome int its scaffold size length. Specifically, using combination of linked reads and proximity ligation, the authors were able to increase scaffold N50 to 127.5 Mb and then assigned to chromosomes and annotated with resequenced mRNA from IPSC cells. This paper represents a significant improvement over the past assemblies of the macaque genome and is a very needed contribution in the field of primate comparative genomics.
I only have several minor comments.
1. It would be good for the authors to explain how they have addressed the main problem with the HiC data: 1) nonrandom association between topological domains (described in Dixon et al., 2012) and 2) the problem with accurate orientation of the inversions within the scaffolds. 2.The authors assume that macaque repeats are identical to humans, and run Repeat Masker. It is at least theoretically possible that there are unknown types of repeats in this species. I would therefore recommend to run Repeat Modeller as well to account for this possibility Changes according to reviewers' comments: 1. The package version and parameter setting of the software used were added into the method section where each software was mentioned. We also attach below the list and the version of all software used in this study. We mentioned the parameters of software and database in the method section. We used default parameter settings for most of the software used. The details of slights changes in parameter settings have been mentioned in the method section. a. Supernova /1.2.0 b. BUSCO /3.0.2 c. Satsuma Synteny 3.1.0 d. Circos 0.69.6 e. SNAP /1.0beta.16 f. RepeatMasker/4.0.8 g. Maker /2.31.9 h. HiSat /2.1.0 i. POFF(proteinortho5.pl) j. RAxML/8.2.10 k. MAFFT/7.310 l. Astral 5.6.2 m. Dfam 3.0 2. The gene names were changed to italic format. 3. We conducted RepeatModeler to identify the new repeats that we identified based on this chromosome level assembly and these repeats were not previously identified or listed in previous assembly or RepBase database. We found additional 1.28% repeats specific to this species that was not identified before. The results of the RepeatModeler are attached to this resubmission of the paper. 4. Also, we attach to this resubmission a detailed statistics of the assemblies either with Linkedread or the combination of linked-read assembly and HiC method (Linkedread + HiC). 5. We deleted the sentence "Here we present, to our knowledge, the first linked-reads and HiC-based assembly of a primate genome." on Page 14 from the discussion. 6. We added the version of the genome assembly of the other species used for the analysis. 7. We added more details to the legend of figure 3.
Reviewer #2: 1. We used Dovetail's HiRise pipeline for the scaffolding, and we have asked them to help with the clarification of the HiC method for assembly as explained below: "1.HiRise does not directly address issues of distinguishing proximity ligation associations due to topological domains from those due to the correlation between proximity and linear separation on a DNA molecule. When run with Hi-C data only, this issue is minimized by having the input assembly contiguity at a larger-than-TAD scale, say contig N50 of 500 kbp to 1 Mbp and up 2. Hi-C data does not in general provide enough resolution to avoid inversion errors for smaller contigs, since in each possible orientation, the Hi-C data have similar numbers of links. For larger contigs, > 500kbp, Hi-C links for the correct orientation dominate and scaffolds are correct. This is a property of Hi-C data independent of scaffolding tool and not due to TADs in particular." We have added a short paragraph about that into the Materials and Methods section.
2. We conducted RepeatModeler to identify the new repeats that we identified based on this chromosome level assembly and these repeats were not previously identified or listed in previous assembly or RepBase database. We found additional 1.28% repeats specific to this species that was not identified before. The results of the RepeatModeler are attached to this resubmission of the paper.