Genome Assembly of Salicaceae Populus deltoides (Eastern Cottonwood) I-69 Based on Nanopore Sequencing and Hi-C Technologies

Abstract Populus deltoides has important ecological and economic values, widely used in poplar breeding programs due to its superior characteristics such as rapid growth and resistance to disease. Although the genome sequence of P. deltoides WV94 is available, the assembly is fragmented. Here, we reported an improved chromosome-level assembly of the P. deltoides cultivar I-69 by combining Nanopore sequencing and chromosome conformation capture (Hi-C) technologies. The assembly was 429.3 Mb in size and contained 657 contigs with a contig N50 length of 2.62 Mb. Hi-C scaffolding of the contigs generated 19 chromosome-level sequences, which covered 97.4% (418 Mb) of the total assembly size. Moreover, repetitive sequences annotation showed that 39.28% of the P. deltoides genome was composed of interspersed elements, including retroelements (23.66%), DNA transposons (6.83%), and unclassified elements (8.79%). We also identified a total of 44 362 protein-coding genes in the current P. deltoides assembly. Compared with the previous genome assembly of P. deltoides WV94, the current assembly had some significantly improved qualities: the contig N50 increased 3.5-fold and the proportion of gaps decreased from 3.2% to 0.08%. This high-quality, well-annotated genome assembly provides a reliable genomic resource for identifying genome variants among individuals, mining candidate genes that control growth and wood quality traits, and facilitating further application of genomics-assisted breeding in populations related to P. deltoides.

for improving the wood quality and yield to meet the growing demand in industry, a variety of genetic and genomic resources need to be developed, especially the high-quality genome sequence (Fang et al. 2018). Furthermore, an accurate genome sequence would be beneficial to performing comparative genomics and studying the evolution of the species.
In forest trees, poplar genomes are relatively small, but they are highly heterozygous, and possess abundant repetitive sequences (Michael and VanBuren 2020). In recent years, the genomes of several poplar species have been sequenced, assembled, and annotated, including Populus trichocarpa (Tuskan et al. 2006), Populus pruinosa (Yang et al. 2017), Populus alba , Populus euphratica , and Populus simonii . Although the genome sequence of P. deltoides WV94 is available in the Phytozome database (http://phytozome.jgi.doe.gov), it contains many gaps (3.2%) and has a lot of sequences that were unanchored into chromosomes, thereby impeding downstream analysis. Therefore, there is an active demand to improve the current assembly of P. deltoides genome. Recent advances in Nanopore sequencing and chromosome conformation capture (Hi-C) technologies play an important role in improving plant genomes at the chromosome level (Giani et al. 2020). Nanopore sequencing exhibits many advantageous qualities, including high sequencing throughput, low cost, and extremely long-read lengths that can span repetitive regions (Cali et al. 2019). Moreover, Hi-C technology has emerged as a robust tool to reconstruct genomes at the chromosome level, providing long-range information about the grouping and linear organization of sequences along entire chromosomes (Burton et al. 2013). Recently, several high-quality plant genomes had been obtained by using Nanopore sequencing and Hi-C technologies, including Eriobotrya japonica (Jiang et al. 2020), Aquilaria sinensis (Ding et al. 2020), and Asparagus setaceus .
In this study, we performed the de novo genome assembly of the P. deltoides cultivar I-69 using Nanopore sequencing and Hi-C technologies. The P. deltoides I-69 was originally collected from a natural population in Illinois, United States, and selected as a cultivar in Italy in the 1950s. Later on, it was introduced to China in 1972 and greatly improved industrial timber production, agroforestry and ecoremediation since then (Zhang et al. 2009). As a female parent, the P. deltoides I-69 was crossed with P. simonii L-3 to generate an F 1 hybrid population for constructing genetic linkage map and locating quantitative trait loci (QTL) in Populus (Mousavi et al. 2016;Tong et al. 2016). Although we successfully obtained the genome sequence of the male parent of P. simonii in a previous study , the lack of high-quality genomic information of the female parent P. deltoides I-69 was a barrier to develop new cultivars in the F 1 hybrid population. Therefore, we conducted the whole genome sequencing and performed de novo genome assembly of the female parent, resulting in an improved genome sequence of P. deltoides. The high-quality assembly of the P. deltoides genome provided a valuable resource for poplar breeding and genetic improvement, as well as comparative genomics analysis with related species.

Biological Materials
A single cutting of the P. deltoides cultivar I-69 was collected from the Siyang Forest Farm in Siyang country, Jiangsu Province, China and planted at the Xiashu Forest Farm of Nanjing Forestry University, Jurong, Jingsu Province, China in 2012 (Mousavi et al. 2016). Fresh leaves were collected from the clonal tree, immediately transferred to liquid nitrogen, and stored until DNA extraction.

DNA Sequencing and Genome Assembly
Genomic DNA was extracted using the Qiagen DNeasy Blood and Tissue kit and evaluated on a 0.75% agarose gel. The concentration and purity of the genomic DNA were determined by Nanodrop (OD260/280) and Qubit (Thermo Fisher Scientific). To generate Nanopore long reads, ~15 µg of genomic DNA was size-selected (>20 Kb) with a Blue Pippin System (Sage Science). According to 1D Genomic DNA by Ligation protocol (Oxford Nanopore Technologies, ONT), ONT SQK-LSK108 library preparation kit was used to construct PCR-free libraries. Nextomics (Nextomic Biosciences Technologies Corporation, Wuhan, China) and Biozeron (Shanghai Biozeron Biotechnology Corporation, Shanghai, China) performed the Nanopore sequencing of prepared libraries on 2 flow-cells using the GrildION X5 sequencer according to the manufacturer's instructions (Oxford Nanopore, Oxford, UK). The raw signal data were base-called using ONT MinKNOW software (v1.4.2), and the statistics of the generating Nanopore sequencing data were calculated using Nanostat (v0.8.1) (De Coster et al. 2018).
In addition, using the same individual tree described above, we sequenced the whole genome of P. deltoides by the Illumina HiSeq 2000 sequencing platform at BMK (Biomarker Technologies Corporation, Beijing, China). The detailed procedure used for whole-genome sequencing was described in Mousavi et al. (2016). The paired-end (PE) reads data are available from the NCBI SRA database under the accession number SRP071167. For the purpose of correcting the contigs assembled with the Nanopore sequencing data, these whole-genome sequencing data were processed to filter the low-quality reads using the NGS QC Toolkit (v2.3.3) (Patel and Jain 2012) with default parameters to obtain high-quality reads.
To assist gene prediction, RNA-sequencing (RNA-seq) data were generated from samples of leaf and stem tissue. Total RNA was isolated using TRIzol Reagent (Invitrogen). RNA libraries were prepared using the TruSeq RNA library Preparation Kit (Illumina), and then sequenced on the Illumina HiSeq 2000 platform with PE mode at Beijing Genomics Institute (BGI). The generating RNA-seq data were also filtered using the NGS QC Toolkit (v2.3.3) with default parameters (Patel and Jain 2012).
The Hi-C library was constructed and sequenced by Frasergen (Wuhan Frasergen Biotechnology Corporation, Wuhan, China) following the standard procedures. Briefly, nuclear chromatin was cross-linked by 1% formaldehyde at room temperature for 30 min. The cross-linked DNA was extracted and then digested with MboI restriction enzyme at 37 °C for 4 h, end-labeled with biotin, and ligated using T4 DNA ligase. The cross-links were reversed by overnight incubation at 65 °C, and the circular DNA was sheared into 300-500 bp fragments. After end repairing, the DNA fragments were purified and sequenced on Illumina HiSeq X Ten platform with PE mode.
The software packages used in the genome assembly and annotation were summarized in Table 1. Oxford Nanopore sequencing data were employed for genome assembly by using the software Canu (v1.9) with default parameters (Koren et al. 2017). To improve the base accuracy, the assembled genome sequence was polished in 3 steps. First, the Nanopore reads were mapped back to the assembled contigs with minimap2 (v2.12) (Li 2018), and then these contigs were polished 3 rounds with Racon (v1.4.3) with default parameters (Vaser et al. 2017). Second, one round of Medaka (v1.4.3) (https://github.com/nanoporetech/medaka) polishing was performed with parameter "-m r941_min_fast_g303" using the Nanopore reads. Third, high-quality Illumina short reads were mapped to the Medaka consensus sequence using BWA-MEM (v0.7.17) (Li 2013), and then subjected to 3 rounds of polishing with NextPolish (v1.1.0) (Hu et al. 2020). Additionally, considering the high heterozygosity of forest tree genome could hinder downstream analysis, we aligned the Nanopore reads to the polished contigs using Minimap2 (v2.12) (Li 2018). Based on this alignment, Purge Haplotigs (v1.1.1) (Roach et al. 2018) was used to identify and remove redundant heterozygous sequences, resulting in the draft genome of the P. deltoides I-69.
To obtain the chromosome-level genome, we aligned the Hi-C reads to the draft genome and computed the Hi-C contact frequency between genomic loci using the Juicer pipeline (v1.7.6) . Afterwards, the contigs were corrected for mis-joins, ordered, oriented, and anchored into a candidate chromosome-length assembly using the 3D-DNA (v180114) with default parameters (Dudchenko et al. 2017). The candidate assembly was visualized with the Juicebox Assembly Tools (JBAT v1.11.8) . Then the misassembled contigs or the mis-joined scaffolds were manually detected and corrected based on the appearance of a bright band of elevated contact frequency along the diagonal of the Hi-C heatmap (Dudchenko et al. 2018). After JBAT review, we used the "run-asm-pipeline-post-review.sh" in 3D-DNA to generate the adjusted assembly, and then the scaffolds shorter than 2 Kb were removed.

Genome Evaluation
To assess the quality of the genome assembly, we mapped the high-quality Illumina short reads back to the final assembled genome using BWA-MEM (v0.7.17) (Li 2013). Then the duplicated reads were marked using the "markdup" function of Sambamba (v0.7.1) (Tarasov et al. 2015), and variant calling was performed to evaluate the accuracy of the genome at the single-base level using GATK HaplotypeCaller (v4.1.8) (DePristo et al. 2011). We further performed Benchmarking Universal Single-Copy Orthologs (BUSCO v4.0.1) analysis to evaluate the completeness of the P. deltoides I-69 assembly with embryophyta_odb10 database (Simao et al. 2015).

Repetitive Sequences Annotation
Plant genomes usually consist of large numbers of repeat elements that have been demonstrated to have structural and functional roles (Biscotti et al. 2015). Repetitive sequences were annotated using RepeatMasker (v4.1.0) (Tarailo-Graovac and Chen 2009) based on a combined library generated by de novo-based and homology-based approaches. We used the software RepeatModeler (v2.0.1) (Tarailo-Graovac and Chen 2009) to construct a de novo repeat library with the assembled genome sequence as input. Then combined with the known repeat library from Repbase (Bao et al. 2015), RepeatMasker was used to perform a homology-based repeat search. To ensure the integrity of genes in the subsequent analyses, low complexity or simple repeats were not masked because some of these sequences could be within genes (Xing et al. 2019).

Genome Assembly
A total of 44.4 Gb long-read data was obtained from the Oxford Nanopore sequencing platform, covering ~100× of the P. deltoides genome, with an average sequence length of 16 Kb and N50 of 25 Kb (Supplementary Figure 1). In addition to the long-read data, the data of Illumina short reads, Hi-C reads, and RNA-seq reads were also obtained and summarized in Table 2. De novo assembly was performed using the Canu software with the Nanopore sequencing data, and further polished for base accuracy in 3 steps (see Methods). As a result, the primary assembly of the P. deltoides consisted of 2186 contigs, amounting to 545 Mb with a contig N50 length of 1.83 Mb, and the length of longest contigs reached 16.8 Mb. After removing the potential heterozygous sequences in the primary genome ( Supplementary  Figure 2), the assembly size reduced to 429 Mb, with the number of contigs reduced to 657 and length of contig N50 increased to 2.62 Mb (Table 3). To construct the chromosome-level genome, a total of 51.04 Gb Hi-C reads were generated. Before scaffolding the assembled contigs, we assessed the quality of Hi-C data, of which 99.18% were mapped to the draft genome and 46.9% were uniquely mapped. Based on Hi-C linking information ( Figure  1), 421 contigs were identified as containing assembly errors and broken into 1716 contigs by using the 3D-DNA software. Then, 3D-DNA assigned 776 contigs with a total size of 418 Mb (97.4%) into 19 groups, perfectly matching the karyotype of Populus. Each group represented a chromosome-level assembly of which the order number was determined by aligning its sequence to the genome sequences of the P. deltoides W94 using Minimap2 (v2.12) (Li 2018). The length of the largest chromosome was 53.17 Mb, while the smallest one was 13.45 Mb. The scaffold N50 of the chromosome-level genome assembly reached 21.51 Mb (Supplementary Table 1).

Genome Evaluation
Based on the alignment of the high-quality Illumina short reads (Table  2) to the genome, the high mapped rate (98.82%) and coverage rate (96.9%) indicated a high consistency between the reads and the genome assembly. Through variant calling, we identified a total of 2 744 622 SNPs (0.64% of the genome), of which 63 341 SNPs (0.0147% of the genome) belonged to homozygous SNPs, indicating a high accuracy of genome assembly at the single-base level (Liang et al. 2020). The completeness of the genome assembly was evaluated using BUSCO with embryophyta_odb10 database. Of the 1614 plant-specific orthologs, 1585 (98.2%) were identified to be complete BUSCO genes, with 0.3% being partial BUSCOs identified and only 1.5% missed. Overall, all the results suggested that the quality of the assembly was high with respect to the base level accuracy and the completeness of the assembly. In addition, the alignment of our P. deltoides I-69 genome to the P. deltoides W94 genome with the MUMmer software (Kurtz et al. 2004) showed that there was a high degree of synteny between the 2 genome sequences (Krzywinski et al. 2009) (Supplementary Figure 3). Moreover, the differences in chromosome sizes among the genomes available in Populus were summarized in Supplementary Table 2.

Gene Prediction
The Trinity assembly of the RNA-seq data contained 117 189 transcripts with a contig N50 of 2134 bp. Combining the ab initio, homology-based, and RNA-seq-assisted predictions, a total of 32 245 genes encoding 44 362 proteins were predicted in the P. deltoides genome. The average CDS length was 1258bp, with the average number of exons was 5.6 for a single CDS (Supplementary Table 4). With the obvious improvement in the contiguity of the P. deltoides I-69 genome, BUSCO analysis showed that the P. deltoides I-69 had more complete BUSCO genes identified (95.2%) than P. deltoides WV94 (92.7%) in the protein-coding genes ( Figure 2). Furthermore, the software OrthoVenn2 was used to identify orthologous genes among the genomes of P. deltoides, P. euphratica, P. trichocarpa, P. simonii, and S. suchowensis (Lee et al. 2020). These species shared a total of 14 446 orthologous groups, and 986 unique orthologous groups for P. deltoides were identified (Supplementary Figure 4).

Discussion
The genome sequence of the P. deltoides I-69 was assembled by integrating Nanopore sequencing and Hi-C technologies, exhibiting some improvements over the genome assembly of P. deltoides WV94 which was assembled using PacBio sequencing technology and a genetic linkage map (http://phytozome.jgi.doe.gov). First, the contig N50 length of our assembly reached 2.62 Mb, a 3.5-fold improvement over that of P. deltoides WV94 (0.59 Mb), possibly due to the Nanopore reads generally being longer than the SMRT-seq reads (Rice and Green 2019). Second, although P. deltoides WV94 genome had its contigs assigned into chromosomes based on a  genetic linkage map, there remained 1356 scaffolds with total size of 43.6 Mb unanchored, accounting for 9.8% of the assembly. By applying Hi-C technology, we accurately constructed 19 chromosomes with total size of 418 Mb accounting for 97.4% of the assembly size, only leaving about 12 Mb sequences unanchored into chromosomes. Third, the genome completeness analysis showed that more completely conserved genes (98.2%) were found in the P. deltoides I-69 than P. deltoides WV94 (97.5%) (Figure 2). These aspects demonstrated that our newly assembled genome achieved a higher level of continuity and quality, suggesting that it can be a good alternative resource for Populus. High heterozygosity is a common feature in forest tree genomes, and the software Canu assembles heterozygous regions into separate contigs when a pair of allelic sequences exceeds a certain threshold of nucleotide diversity, rather than the expected single haplotype-fused contig (Roach et al. 2018). Thus, the overall assembly size resulted from Canu will be larger than the haploid genome size, making it difficult to scaffold contigs and predict genes downstream (Guan et al. 2020). To solve such a problem, we applied the Purge Haplotigs software to remove the heterozygous sequences of the P. deltoides I-69 genome in this study (Roach et al. 2018). Before the analysis with Purge Haplotigs, the assembly size of the P. deltoides I-69 genome was 545 Mb, substantially exceeding the size of the existed P. deltoides WV94 genome (447 Mb).
After the removing process, the assembly size of the P. deltoides I-69 reduced to 429 Mb, and the contig N50 increased from 1.8 to 2.62 Mb. Meanwhile, BUSCO analysis showed that the proportion of complete BUSCO genes was also increased from 96.6% to 98.2%. The most obvious effect was that the duplicated BUSCO proportion decreased from 30.9% to 15.7%, and the proportion of the complete and single-copy BUSCOs increased from 65.7% to 82.5%, suggesting that the heterozygosity of the P. deltoides I-69 genome reached a lower level (Ran et al. 2019).
With the advances in genomic assembly technologies, Hi-C technology was able to identify more complete sequences at the chromosome level, but it still contained some assembly errors during scaffolding (Giani et al. 2020). For instance, the folding of chromatin in the topologically associated domain could result in higher interaction frequencies of distant regions along chromosome, which will potentially cause misestimation of the distance between 2 adjacent regions along chromosome, leading to some artificial inversions and mis-joined scaffolds during assembly (Oddes et al. 2018;Xu and Dixon 2020). These errors can be better corrected by resorting a dense genetic map, so that several species had applied genetic linkage maps to refine assemblies at present (Maroso et al. 2018;Gaur et al. 2020;Shi et al. 2020). Therefore, efforts could be made to construct a high-quality, high-density linkage map of P. deltoides for further refining the P. deltoides genome assembly.
The current genome assembly of the P. deltoides I-69 provides an essential genetic resource for mining genes underlying economically important traits and studying the evolution of the species in Populus. First, the new assembly provides a specific parental reference sequence for extracting tens of thousands of SNP genotypes across an F 1 hybrid population in which the P. deltoides I-69 was as a female parent (Mousavi et al. 2016;Tong et al. 2016). With the large number of SNPs, high-density genetic linkage maps of the parents can be constructed and thus the QTL mapping or genomewide association studies could be conducted for growth and woody traits Chen et al. 2021). Second, this genome sequence allowed us to identify genes unique to the cultivar I-69 through orthologous analysis (Supplementary Figure 4). The unique genes may contain some detrimental genes that could cause this cultivar to have a poor rooting ability (Zhang et al. 2009). It is very important to purge these detrimental genes by selective breeding so as to develop new cultivars with improved rooting capacity for increasing the survival rate and expanding the scale of planting. Finally, the improved genome sequence would be beneficial to further elucidating the inconsistent results of phylogenetic relationships among P. deltoides, P. trichocarpa, and P. simonii. According to morphology, P. trichocarpa and P. simonii were assigned into the same section of Tacamahaca, while P. deltoides belonged to the different section of Aigeiros (Eckenwalder 1996), indicating that P. trichocarpa and P. simonii were more closely related than any other pair of the 3 species. However, recent molecular evolution analyses showed that the relationship of P. deltoides with P. trichocarpa was closer than with P. simonii (Zong et al. 2019;Wu et al. 2020).

Supplementary Material
Supplementary data are available at Journal of Heredity online.

Funding
This work was supported by the National Natural Science Foundation of China (31870654 and 31270706 awarded to C.T.) and the Priority Academic Program Development of Jiangsu Higher Education Institutions. The funding bodies were not involved in the design of the study, collection, analysis, and interpretation of data, and in writing the manuscript.