“Mind the Gap”: Hi-C Technology Boosts Contiguity of the Globe Artichoke Genome in Low-Recombination Regions

Globe artichoke (Cynara cardunculus var. scolymus; 2n2x=34) is cropped largely in the Mediterranean region, being Italy the leading world producer; however, over time, its cultivation has spread to the Americas and China. In 2016, we released the first (v1.0) globe artichoke genome sequence (http://www.artichokegenome.unito.it/). Its assembly was generated using ∼133-fold Illumina sequencing data, covering 725 of the 1,084 Mb genome, of which 526 Mb (73%) were anchored to 17 chromosomal pseudomolecules. Based on v1.0 sequencing data, we generated a new genome assembly (v2.0), obtained from a Hi-C (Dovetail) genomic library, and which improves the scaffold N50 from 126 kb to 44.8 Mb (∼356-fold increase) and N90 from 29 kb to 17.8 Mb (∼685-fold increase). While the L90 of the v1.0 sequence included 6,123 scaffolds, the new v2.0 just 15 super-scaffolds, a number close to the haploid chromosome number of the species. The newly generated super-scaffolds were assigned to pseudomolecules using reciprocal blast procedures. The cumulative size of unplaced scaffolds in v2.0 was reduced of 165 Mb, increasing to 94% the anchored genome sequence. The marked improvement is mainly attributable to the ability of the proximity ligation-based approach to deal with both heterochromatic (e.g.: peri-centromeric) and euchromatic regions during the assembly procedure, which allowed to physically locate low recombination regions. The new high-quality reference genome enhances the taxonomic breadth of the data available for comparative plant genomics and led to a new accurate gene prediction (28,632 genes), thus promoting the map-based cloning of economically important genes.

small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally; however, the reconstruction of large and complex eukaryotic genomes, such as the ones of plants, continue to pose significant challenges (Ghurye and Pop 2019). Short reads technology (e.g.: Illumina) is generally combined with long-reads sequencing technologies, such as Singlemolecule real-time sequencing (SMRT, Pacific Biosciences) or nanopore sequencing (Oxford Nanopore technologies). Furthermore, with the goal of improving the assembly quality, cutting edge scaffolding technologies such as linked-reads (10X Genomics), optical mapping (Bionano Genomics) and proximity ligation methods (Hi-C, Dovetail Genomics) are adopted.
Hi-C is a proximity ligation based method, which relies on the fact that, after fixation, segments of DNA in close proximity in the nucleus are more likely ligated together and sequenced as pairs in respect to more distant regions. As a result, the number of read pairs between intra-chromosomal regions is a slowly decreasing function of the genomic distance between them. Furthermore, Hi-C could theoretically allow score contact frequency between virtually any pair of genomic loci (Lieberman-Aiden et al. 2009).
Globe artichoke harbors a highly heterozygous genetic background, which hampers the production of a reference assembly. We developed an inbred genotype with a 10% of residual heterozygosity, of which we released the first globe artichoke genome sequence (Scaglione et al. 2016). The assembly (v1.0) was generated using 133-fold Illumina sequencing data and covered 725 of the 1,084 Mb genome. Through genetic mapping, we anchored 526 Mb (73%) of the genome sequence to 17 chromosomal pseudomolecules, although 199 Mb (27%) remained unplaced. More recently, we released an improved annotation (v1.1) of the v1.0 assembly and the genome sequence of four globe artichoke genotypes (Acquadro et al. 2017), as well as a genotype of cultivated cardoon.
Here we report on a new reference genome (v2.0), obtained by sequencing a Hi-C genomic library and assembling data with previously generated sequence datasets. This new chromosome-level version is characterized by a high contiguity and reduces drastically the number of unplaced scaffolds.

MATERIALS AND METHODS
Hi-C Library preparation, sequencing and assembling Fresh etiolated leaves of a globe artichoke inbred line (2C), from which we generated the reference genome (Scaglione et al. 2016), was provided to Dovetail Genomics (https://dovetailgenomics.com). DNA was extracted from leaf samples and used to construct a Hi-C library following manufacturer protocols (Putnam et al. 2016).
The Hi-C library was then quality checked through sequencing (2M PE 75bp reads, Illumina, MiSeq) and reads mapped back to the draft assembly. Afterward, extensive Illumina sequencing was performed with an Illumina HiSeq X instrument (PE150bp reads chemistry).

Gene prediction
The new assembly was masked using RepeatMasker (Smit et al. 2013(Smit et al. -2015 using a combination of homology-based and de novo approaches. After a soft masking step, a gene prediction was performed using Maker-P (Campbell et al. 2014). Augustus (Stanke et al. 2006) Hidden Markov Models and SNAP (Bromberg and Rost 2007) gene prediction algorithms were combined with artichoke transcripts available in NCBI and proteins alignments as evidence to support prediction. All predicted gene models were filtered to maintain only those with a AED # 0.35; this value measures the concordance between the predicted model and the experimental tests, with reliability of the higher models and low AED values. For each predicted gene, the gene function was assigned by a BlastP (Altschul et al. 1990) search against the Uniprot/Swissprot Viridiplantae database (The UniProt Consortium 2014), using the default parameters, with the exception of the e-value (, 1e -5 ). The sequences of the predicted proteins were also noted using InterproScan The MIReNA (Mathelier and Carbone 2010) software was used for the identification of high confidence miRNA-coding sequences (miRBase release 21 (Kozomara and Griffiths-Jones 2011): high confidence database). An homology search was conducted with known miRNAs from an array of 13 species (plants and algae), including: Solanum lycopersicum, Solanum tuberosum, Nicotiana tabacum, Vitis vinifera, Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Medicago trunculata, Zea mays, Picea abies, Triticum aestivum, Physcomitrella patens, Chlamydomonas reinhardtii. MIReNA was run with default parameters and the maximum number of allowed mismatches between known miRNAs and putative miRNAs was set to 10.

Genome integrity and completeness
The QUAST pipeline (Mikheenko et al. 2018), which includes the BUSCO software (Simão et al. 2015), was used for the comparison among the new and the previous versions of the genome. Plant dataset (Embryophyta, odb9) was downloaded from Busco (Simão et al. 2015) and manually implemented in the QUAST pipeline. A comparison between different versions of the globe artichoke assembled genomes was conducted retrieving co-linear blocks through Last aligner (Kiełbasa et al. 2011). Only blocks with pairwise minimal identity major/equal than 99% were plotted using Circos tool (Krzywinski et al. 2009).

Data availability
Raw reads are publicly available in the NCBI sequence read archive under the bioproject: PRJNA238069. The reference assembly (v2.0) and annotation data are either available for downloading from http:// www.artichokegenome.unito.it.

RESULTS AND DISCUSSION
Sequencing, assembling and metrics We developed a new genome assembly (v2.0) using Hi-C technology, which is based on proximity ligation and massively parallel sequencing to probe the three-dimensional structure of chromosomes within the nucleus, and capture interactions by paired-end sequencing (Putnam et al. 2016;Ghurye et al. 2017). A single genomic library was sequenced using Illumina chemistry and a total of 156,683,926 pair end reads (2x150bp; 47.01 Gbp) generated. Hi-C reads were used in the assembly procedure, by adopting the existing genomic scaffolds as starting sequences (Scaglione et al. 2016), through the HiRise assembly pipeline, and enabled an accurate assembly of the globe n■ artichoke genome up to the chromosome-level (Table 1). In all 5,023 super-scaffolds were generated, with an average size of 144,578 bp. The largest 18 super-scaffolds were assigned to chromosomes using reciprocal blast procedures. The 17 pseudomolecules were reconstructed also by joining together two super-scaffolds (13,663 and 1,119) in chromosome 6.
To assess the improvement obtained in the new assembly, a first comparison was performed between the Hi-C pseudomolecules (v2.0) toward the original scaffolds of v1.0. This resulted in an improvement of the N 50 value, which increased from 126 kb to 44.8 Mb (356-fold increase) and the N 90 , which reached 17.8 Mb compared to the original v1.0 value of 29 kb (685-fold increase). The huge improvement of the HI-C assembly was also highlighted by the L 90 value, which dramatically drop down from 6,123 scaffolds in the v1.0 version to just 15 super-scaffolds, a number close to the haploid chromosome number of the species. Similar remarkable improvements were also highlighted by comparing the Hi-C superscaffolds with the anchored version of the genome (v1.0, pseudomolecules-based plus scaffolds) (Figure 1; Table 1). As an example, the N 50 value jumped from 26Mb in v1.0 to 45Mb in v2.0, while the L 90 dropped down from 1,384 of the V1.0 to 15 in the HI-C assembly. Focusing on the unanchored portion of the genome (namely Chr0), the 199 Mb of unplaced sequence in v1.0, which included 8,327 scaffolds, was decreased to less than 34 Mb (5,005 sequences), as 165 Mb (83%) were assigned to super-scaffolds. On the whole, the percentage of anchored genome increased to 94% and the chromosome size extended with a medium gain of 36% (Table 4). The highest increase was observed in chromosome 14, whose size enlarged of 14Mb (97%), in respect to the v1.0. Some chromosomes showed scattered insertion of the new anchored scaffolds (i.e.: 1, 2, 6, 9, 10, 12, 13), while in others (i.e.: 3, 4, 5, 7, 8, 11, 14, 15, 16, 17) distinct extensive regions (ranging from 2.9Mb to 29.3Mb) were anchored (Figure 2).

Genome annotation
In the genome Hi-C version, the annotation pipeline predicted 28,632 genes, a higher number than the one predicted in v1.0 (i.e.: 26,889; (Scaglione et al. 2016)), and very close to the one we recently obtained following the genome reconstruction of globe artichoke genotypes (i.e.: 28,310, v1.1) (Acquadro et al. 2017). The number of genes in unplaced scaffolds was just 557 (1,9% of the total genes), raising up the number of genes (+4,180, 17%) placed on pseudomolecules. This number (557) is by far lower than the one located on Chr0 in the two previous structural annotations: i.e., 2,994 (Scaglione et al. 2016) and 3,471 (Acquadro et al. 2017). Following Busco (Simão et al. 2015) analysis, as expected the number of represented orthologs in Hi-C assembly (92.7%) was just slightly higher compared to the previous version (91.4%), being essentially unaltered the sequences of the contigs during the assembly process (data not shown).
The InterProScan analyses highlighted about 80% of the predicted proteins with at least one IPR domain, in line with the previous v1.0 and v1.1 annotation. Among the top 20 SUPERFAMILY domains, listed in Table 2, the most abundant in all the genomes was SSF52540 (P-loop containing nucleoside triphosphate hydrolase), which is involved in several UniPathways, including chlorophyll or coenzyme A biosynthesis. The other most abundant Superfamilies were: SSF56112 (protein Kinase-like domain), which acts on signaling and regulatory processes in the eukaryotic cell, SSF52058 (Leucine-rich repeat domain, L domain-like), which is related to resistance to pathogens and SSF48371 (Armadillo-type fold), which plays a role in defense response and translation factor activity. These findings are comparable to both v1.0 and v1.1 annotations, suggesting that Hi-C had a greater effect in improving the quality of the genome sequence than its annotation.
From a search against miRBase 21 high confidence database, species-specific miRNAs were predicted. The total number of predicted non-redundant was 144 (in 253 genome regions of the reference 2C), in line with what previously reported on annotation v1.1 (143 (Acquadro et al. 2017). The identified miRNAs belong to 37 families (Table 3), slightly lower than the ones previously reported (Acquadro et al. 2017). Notwithstanding, the most highly-represented miRNA families are shared between the two annotations, which are conserved in many taxonomic groups, as already spotted in previous studies (Cuperus et al. 2011;Chávez Montes et al. 2014;Barchi et al. 2019).

Mis-assembly level and co-linearity among assemblies
The Hi-C increased of about 30% the size of anchored genome, and accordingly the majority of the newly assembled chromosomes increased their size (Table 4). In particular, chromosomes 3, 5, 8, 11, 14 and 15 expanded of at least 50% in size, compared to the v1.0. (Figure 2). The Quast (Gurevich et al. 2013) analysis highlighted that 4,727 scaffolds were mis-assembled. The mis-assemblies were grouped in 3,553 re-locations on the same pseudomolecule, 1,157 translocations and 17 inversions. Following a more in-depth analysis, the mis-assembled scaffolds corresponded to just 54.6Mb of genomic sequence, which included small size fragments (average 11.6Kb, median n■ The Hi-C and the v1.0 of the globe artichoke genome assembly were highly co-linear (pseudomolecules plus un-placed scaffold; Figure 2). The remarkable improvement in size of the Hi-C assembly is attributable to the ability of the proximity ligation-based approach to deal with heterochromatic (pericentromeric and telomeric) regions. The latter are characterized by a low recombination rate, low gene density and high TE accumulation (Nachman 2002), thus their analysis is a tough task (Zhang et al. 2014) when a classical genetic mapping approach relying on the recombination rate (Scaglione et al. 2016) is used. This is the case of v1.0. genome assembly, while the v2.0 was based on the proximity ligation technology, which is recombination rate aware. The case of chromosomes 3, 5, 8, 14 is emblematic. A clear un-aligned region ("extended gap") was present in their metacentric/sub-metacentric region in version 1.0, which in chromosomes 3 and 5 spanned up to 30Mbs. Similarly, in the terminal region of chromosomes 11 and 15, which in a previous study (Scaglione et al. 2016) appeared to be telocentric/acrocentric on the basis of their gene frequency, some scaffolds were missing in v1.0, but correctly assigned in v2.0.
All this is confirmed by the fact that the gene frequency of the newly placed scaffolds in the v2.0 assembly was just 29 genes/Mb, by far lower than the average gene frequency detected in both v1.0 and v2.0 (45 genes/Mb), and that the large newly extended regions in chr. 3,5,8,11,14 and 15 showed a furtherly reduced gene frequency (16 genes/Mb, see Figure 3).

ACKNOWLEDGMENTS
We thank Richard Michelmore (Genome Center, UC-Davis) for suggesting the use of the Hi-C technology with the goal to improve the assembly of our previously published globe artichoke genome sequence.