A chromosome-level reference genome of the hazelnut, Corylus heterophylla Fisch

Abstract Background Corylus heterophylla Fisch. is a species of the Betulaceae family native to China. As an economically and ecologically important nut tree, C. heterophylla can survive in extremely low temperatures (–30 to –40 °C). To deepen our knowledge of the Betulaceae species and facilitate the use of C. heterophylla for breeding and its genetic improvement, we have sequenced the whole genome of C. heterophylla. Findings Based on >64.99 Gb (∼175.30×) of Nanopore long reads, we assembled a 370.75-Mb C. heterophylla genome with contig N50 and scaffold N50 sizes of 2.07 and 31.33  Mb, respectively, accounting for 99.23% of the estimated genome size (373.61 Mb). Furthermore, 361.90 Mb contigs were anchored to 11 chromosomes using Hi-C link data, representing 97.61% of the assembled genome sequences. Transcriptomes representing 4 different tissues were sequenced to assist protein-coding gene prediction. A total of 27,591 protein-coding genes were identified, of which 92.02% (25,389) were functionally annotated. The phylogenetic analysis showed that C. heterophylla is close to Ostrya japonica, and they diverged from their common ancestor ∼52.79 million years ago. Conclusions We generated a high-quality chromosome-level genome of C. heterophylla. This genome resource will promote research on the molecular mechanisms of how the hazelnut responds to environmental stresses and serves as an important resource for genome-assisted improvement in cold and drought resistance of the Corylus genus.

conditions. Therefore, the cold and drought resistance characteristics of C. heterophylla can be 75 used as parent materials for cross-breeding with other hazel species.

76
In the present study, to better understand the molecular mechanism of how hazelnuts respond to 77 environmental stress, we assembled a high-quality genome of C. heterophylla using a 78 combination of the Oxford Nanopore high-throughput sequencing technology and the   Table S1a). x) were processed by Jellyfish to assess their k-mer distribution (k-mer value = 19).

115
Theoretically, the k-mer frequency follows a Poisson distribution. We selected k = 19 for the 116 genome size estimation in this study. Genome sizes were calculated from the following reads (fastq) were extracted from base-called FAST5 files using poretools [12]. Then, the short 131 reads (<5 kb) and reads having low-quality bases and adapter sequences (YSFRI, 2019c) were Tables S1b and S1c).  Table S1d).

144
Hi-C experiments were performed as described with some modifications [13,14]. Briefly, 2 g of 145 freshly harvested leaves were cut into 2-to 3-mm pieces and infiltrated in 2% formaldehyde 146 before cross-linking was stopped by adding glycine. The tissue was ground to powder and 147 suspended in nuclei isolation buffer to obtain a nuclei suspension. The procedure for the Hi-C 148 experiment, including chromatin digestion, labeling of DNA ends, DNA ligation, purification, 149 and fragmentation, was performed as described previously [15]. The cross-linked DNA was digested with HindIII as previously described and marked by incubating with Klenow enzyme 151 and biotin-14-dCTP overnight at 37 C [15]. The 5' overhang of the fragments was repaired and 152 labeled using biotinylated nucleotides, followed by ligation with T4 DNA polymerase. After 153 reversal of cross-linking, ligated DNA was purified and sheared to 300-700 bp fragments using 154 an S2 Focused-Ultrasonicator (Covaris Inc., MA, USA). The linked DNA fragments were 155 enriched with streptavidin beads and prepared for Illumina HiSeq X Ten sequencing, producing 156 231.31 Mb (totaling ~69.11 Gb) Hi-C links data (Supplementary Table S1e).  Table S1a).  Table S2).

186
The BUSCO database detected 93.47% and 1.18% of complete and partial gene models, 187 respectively, in the C. heterophylla assembly results ( Table 3). The core eukaryotic  Table S3b). Additionally, the heatmap of the Hi-C interaction frequency was 197 selected to visually assess the assembled accuracy of the C. heterophylla genome. The  The signal intensities clearly divide the 'bins' into eleven distinct groups (LG01-LG11), 202 demonstrating the high quality of the chromosome assignment (Fig. 2). These observations 203 suggest the high quality and completeness of this chromosome-level reference genome for C.  Table 4). The top three classes of repetitive elements were 220 ClassI/LARD, ClassI/LTR/Gypsy, and ClassI/LTR/Copia, occupying 20.51%, 11.14%, and 221 10.44% of assembled genome sequences, respectively (Table 4).

222
Gene annotation was performed using a combination of ab initio prediction, homology-based 223 gene prediction, and transcript evidence from RNA-seq data. The de novo approach was    (Table 1).  Table S4c).

271
In the gene family and phylogenetic analysis, the protein-coding genes of Oryza sativa, 272 Arabidopsis thaliana, Populus trichocarpa, Quercus variabilis, Juglans regia, Betula pendula, 273 Ostrya japonica, and C. heterophylla were downloaded from Genbank or Ensembl databases.

274
The longest transcripts were selected to represent the protein-coding genes. Protein sequence 275 clustering was performed using OrthoMCL (OrthoMCL, RRID: SCR_007839) v2.0 [59] with 276 default parameters to identify the orthologous groups. The result showed that C. heterophylla 277 has 16,811 orthologous groups, including 5,150 single-copy genes, 6,040 multiple-copy genes, 278 and 582 specific genes. Notably, 222 species-specific families were identified for C. 279 heterophylla, which might contribute to its unique features (Fig. 3A). To construct the 280 phylogenetic analysis, 1,182 single-copy orthologs were identified from one copy families of 281 selected species. The protein sequences of single-copy orthologs were aligned using MUSCLE To our knowledge, this is the first report of a chromosome-level genome assembly of C.   The authors declare that they have no competing interests.     Note: only sequences whose length is more than 1 kb are considered.