A Genome Assembly of the Barley ‘Transformation Reference’ Cultivar Golden Promise

Barley (Hordeum vulgare) is one of the most important crops worldwide and is also considered a research model for the large-genome small grain temperate cereals. Despite genomic resources improving all the time, they are limited for the cv. Golden Promise, the most efficient genotype for genetic transformation. We have developed a barley cv. Golden Promise reference assembly integrating Illumina paired-end reads, long mate-pair reads, Dovetail Chicago in vitro proximity ligation libraries and chromosome conformation capture sequencing (Hi-C) libraries into a contiguous reference assembly. The assembled genome of 7 chromosomes and 4.13Gb in size, has a super-scaffold N50 after Chicago libraries of 4.14Mb and contains only 2.2% gaps. Using BUSCO (benchmarking universal single copy orthologous genes) as evaluation the genome assembly contains 95.2% of complete and single copy genes from the plant database. A high-quality Golden Promise reference assembly will be useful and utilized by the whole barley research community but will prove particularly useful for CRISPR-Cas9 experiments.

Barley is a true diploid with 14 chromosomes (2n = 14). Its genome is around 5Gb in size and mainly consists of repetitive elements (International Barley Genome Sequencing Consortium 2012). Barley is and has been an important crop for thousands of years (Mascher et al. 2016). It was the fourth most produced cereal in 2016 worldwide (Faostat, http://www.fao.org/faostat/en/#home) and second most in the UK. While the majority of barley is used as feed, the most important market for 2-row spring barley is the whisky industry. An iconic historical variety is the cv. Golden Promise which was used extensively for malting and whisky production and some distilleries still use it today. Golden Promise is a 2-row spring type which was mainly grown in Scotland in the 1970s and early 1980s and was identified as a semi-dwarf mutant after a gamma-ray treatment of the cultivar Maythorpe. In recent years, the main research interest in Golden Promise has come from its genetic transformability. Most barley transformations are successfully conducted using Golden Promise as it usually achieves the best shoot recovery from callus (Hensel et al. 2008). While many other cultivars have been tested and some successfully used, the transformation efficiency of Golden Promise is always superior (Murray et al. 2004;Ibrahim et al. 2010;Lim et al. 2018) With the rise of the CRISPR-Cas9 genome editing technology, a potential Golden Promise reference assembly has already sparked wide interest in the barley community. The use of CRISPR-Cas9 ideally requires a complete and correct reference assembly for the identification of target sites (Karkute et al. 2017). The Cas9 enzyme targets a position in the genome based on a sgRNA (single-guide RNA) followed by a PAM (protospacer-adjacent motif). The guide RNA is usually designed to be 20 bp long and target-specific to avoid any off-target effects. The PAM region consists of three nucleotides "NGG" (Belhaj et al. 2013;Lawrenson et al. 2015). Any nucleotide variation between different cultivars can therefore cause problems with the CRISPR-Cas9 genome editing technology (Bortesi et al. 2016;Jaganathan et al. 2018). The time and cost involved in such increasingly common experiments highlights the value of a high-quality Golden Promise reference assembly.

Contig construction and scaffolding
DNA extraction, library construction and sequencing: High molecular weight barley DNA was isolated from leaf material of 3-week old Golden Promise plants that had been kept in the dark for 48 hr to reduce starch levels. DNA was extracted using the GE Life Sciences Nucleon PhytoPure kit (GE Healthcare Life Sciences, Buckinghamshire, UK) according to the Manufacturers' instructions. Both pairedend and long mate-pair libraries were constructed and sequenced at the Earlham Institute by the Genomics Pipelines Group. A total of 2 mg of DNA was sheared targeting 1 kbp fragments on a Covaris-S2 (Covaris Brighton, UK), size selected on a Sage Science Blue Pippin 1.5% cassette (Sage Science, Beverly, USA) to remove DNA molecules ,600bp, and amplification-free, paired-end libraries constructed using the Kapa Biosciences Hyper Prep Kit (Roche, New Jersey, USA). Long mate-pair libraries were constructed from 9 mg of DNA according to the protocol described in Heavens et al. (2015) based on the Illumina Nextera Long Mate Pair Kit (Illumina, San Diego, USA). Sequencing was performed on Illumina HiSeq 2500 instruments with a 2x250 bp read metric targeting .60x raw coverage of the amplification-free library and 30x coverage of a combination of different insert long mate-pair libraries with inserts sizes .7 kbp.
Contig and scaffold generation: Contigging was performed using the w2rap-contigger (Clavijo et al. 2017). Three mate-pair libraries were produced with insert sizes 6.5, 8 and 9.5kb and sequenced to generate approximately 284 million 2x250 bp reads. Mate-pair reads were processed and used to scaffold contigs as described in the w2rap pipeline (Clavijo et al. 2017; https://github.com/bioinfologics/w2rap). Scaffolds less than 500 bp were removed from the final assembly.

Chromosome conformation capture
Dovetail: Golden Promise 10-day old leaf material was sent to Dovetail Genomics (Santa Cruz, CA, USA) for the construction of Chicago libraries. Dovetail extracted high molecular weight DNA and conducted the library preparations. The Chicago libraries were sequenced on an Illumina HiSeqX (Illumina, San Diego, CA, USA) with 150bp paired-end reads. Using the scaffold assembly as input, the HiRise scaffolding pipeline was used to build super scaffolds (Putnam et al. 2016).

Hi-C:
The Hi-C library construction from one week old seedlings of Golden Promise was performed as per protocol described in Padmarasu et al. (2019) using DpnII for digestion of crosslinked chromatin. Sequencing of the Hi-C library was conducted on an Illumina HiSeq 2500 (Illumina, San Diego, CA, USA) with 101 bp paired-end reads. Super scaffolds from Dovetail were ordered and orientated to build the final pseudomolecule using the TRITEX assembly pipeline (Monat et al. 2019), with a detailed user guide available (https://tritexassembly.bitbucket.io).

Data validation and quality control
We used BUSCO with the plant dataset (embryophyta_odb9). For gene prediction BUSCO uses Augustus (Version 3.3) (Stanke et al. 2004;König et al. 2016). For the gene finding parameters in Augustus we set species to wheat and ran BUSCO in the genome mode (-m geno -sp wheat).

Genome assembly
Here we report a full-length Golden Promise genome assembly which was generated integrating short read sequencing and two chromosome conformation sequencing approaches. Approximately 624 million 2x250 bp paired reads were generated providing an estimated 62.4x coverage of the genome. 245,820 scaffolds were generated comprising 4.11 Gb of sequence with an N50 of 86.6kb. Gaps comprised only 1.6% of the scaffolds (Table 1). To generate full chromosome assembly, we utilized two different chromosome conformation captures. In a first n■ step, we used Chicago Dovetail data which is generated by in vitro proximity ligation of large DNA fragments to increase the scaffold size and to correct false misjoins from the previous scaffolding. In the next step, we integrated Hi-C data which uses the native chromatin folding to increase the contiguity to full chromosome size. This resulted in a final assembly of 4.13Gb and 7 chromosomes plus an extra chromosome containing the unassigned scaffolds. We have provided the reference sequence as a blast and gmap searchable website for easy access: https://ics.hutton.ac.uk/gmapper/.

Completeness of the assembly
We used the spectra-cn function from the Kmer Analysis Toolkit (KAT) (Mapleson et al. 2017) to check for content inclusion in the contigs and scaffolds. KAT generates a k-mer frequency distribution from the paired-end reads and identifies how many times k-mers from each part of the distribution appear in the assembly being compared. It is assumed that with high coverage of paired-end reads, every part of the underlying genome has been sampled. Ideally, an assembly should contain all k-mers found in the reads (not including k-mers arising from sequencing errors) and no k-mers not present in the reads. The spectra-cn plot in Figure 1a generated from the contigs shows sequencing errors (k-mer multiplicity ,20) appearing in black as these are not included in the assembly. The majority of the content appears in a single red peak indicating sequence that appears once in the assembly. The black region under the main peak is very small indicating that most of this content from the reads is present in the assembly. The content that appears to the right of the main peak and is present twice or three times in the assembly represents repeats.
Scaffolds generally contain more miss-assemblies than contigs and this is reflected in the spectra-cn plot in Figure 1b generated from the scaffolds. The red bar at k-mer multiplicity 0 that is not present in the contigs spectra-cn plot reflects k-mers that appear in the scaffolds but do not appear in the reads. Approximately 7.2 million k-mers are represented in this region, less than 0.15% of the total.

Repetitive regions
The Golden Promise reference assembly was analyzed for repetitive regions using RepeatMasker with the TREP repeat library. This identified 73.2% (2.95 Gb) of the Golden Promise assembly as transposable elements (Table 2)

Transcript annotation
For transcript annotation we transferred the latest barley annotation from MorexV2 onto the Golden Promise reference assembly. From a Figure 1 Spectra cn plots comparing k-mers from the paired-end reads to kmers in (a) the contig assembly and (b) the scaffold assembly.
n■ Data validation and quality control We used two approaches to evaluate the quality of the Golden Promise assembly based on gene content. The analysis was done for each of the steps along the assembly process. The first approach was done with BUSCO (Benchmarking Universal Single-Copy Orthologs, v3.0.2) (Simão et al. 2015;Waterhouse et al. 2018). It assesses the completeness of a genome by identifying conserved single-copy, orthologous genes. Even the contig stage had already more complete single copy genes, 92.4%, in comparison to the published barley assembly from the cultivar MorexV1 with 91.5% (Figure 2a). Throughout the assembly process this improved to 95.2% of complete and single copy genes in the final pseudomolecule. This is very close to the recently published MorexV2 assembly with 97.2% of single copy genes. As expected, the number of fragmented sequences decreased during the assembly process from 2.8% of fragmented genes to only 1.1% in the pseudomolecule. The second approach used a flcDNA dataset which consists of 22,651 sequences generated from the cultivar Haruna Nijo (Sato et al. 2009;Matsumoto et al. 2011). These sequences were created from 12 different conditions and representing a good snapshot of the barley transcriptome. They can be used to identify the number of retained sequences in the Golden Promise pseudomolecule and give an impression on the segmentation of the pseudomolecule, highlighted by cDNAs which have been split within or across chromosomes. The 22,651 flcDNAs were mapped to the Golden Promise pseudomolecule using Gmap (version 2018-03-25; Wu and Watanabe 2005) with the following parameters: a minimum identity of 98% and a minimum trimmed coverage of 95%. The results for this dataset are very similar to the BUSCO analysis. The contigs already contained 81.4% of complete and single copy genes in comparison to the 73% of the MorexV1 reference (Figure 2b). The final assembly contained 87.1% of complete and single copy genes, 14% more than the barley reference MorexV1 and around 400 genes more in comparison to MorexV2 accounting for a difference of 1.9%. Similar to the BUSCO analysis the number of duplicated complete genes and the number of fragmented genes is decreased in the Golden Promise assembly. Again, the overall comparison to MorexV2 shows very similar results emphasizing the high quality of both barley genomes.

CONCLUSION
Here, we presented such an assembly that is an improvement on the currently available barley reference from the cultivar MorexV1 (International Barley Genome Sequencing Consortium 2012; Mascher et al. 2017) and near-equivalent to the recently released MorexV2 (Monat et al. 2019). Importantly, it is a European 2-row cultivar, expanding barley genomic resources to European breeding material in contrast to the American 6-row cultivar Morex. The importance of having another genome assembly has already been demonstrated in the analysis of the highly divergent Jekyll genes (Radchuk et al. 2019). We anticipate it will benefit the whole barley research community but will be especially useful for groups working on CRISPR-Cas9.