Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight

Taro (Colocasia esculenta) is a food staple widely cultivated in the humid tropics of Asia, Africa, Pacific and the Caribbean. One of the greatest threats to taro production is Taro Leaf Blight caused by the oomycete pathogen Phytophthora colocasiae. Here we describe a de novo taro genome assembly and use it to analyze sequence data from a Taro Leaf Blight resistant mapping population. The genome was assembled from linked-read sequences (10x Genomics; ∼60x coverage) and gap-filled and scaffolded with contigs assembled from Oxford Nanopore Technology long-reads and linkage map results. The haploid assembly was 2.45 Gb total, with a maximum contig length of 38 Mb and scaffold N50 of 317,420 bp. A comparison of family-level (Araceae) genome features reveals the repeat content of taro to be 82%, >3.5x greater than in great duckweed (Spirodela polyrhiza), 23%. Both genomes recovered a similar percent of Benchmarking Universal Single-copy Orthologs, 80% and 84%, based on a 3,236 gene database for monocot plants. A greater number of nucleotide-binding leucine-rich repeat disease resistance genes were present in genomes of taro than the duckweed, ∼391 vs. ∼70 (∼182 and ∼46 complete). The mapping population data revealed 16 major linkage groups with 520 markers, and 10 quantitative trait loci (QTL) significantly associated with Taro Leaf Blight disease resistance. The genome sequence of taro enhances our understanding of resistance to TLB, and provides markers that may accelerate breeding programs. This genome project may provide a template for developing genomic resources in other understudied plant species.

genome reference and understanding the genetic basis of resistance to TLB in taro.
The geographic center of taro domestication is not clear from archeological records, but molecular analyses indicate the presence of two separate gene pools that originate from the Indo-Malayan region in Asia and Melanesia in the Pacific (Coates et al. 1988, Lebot and Aradhya 1991, Mace et al. 2006, Miyasaka et al. 2019. With a basic chromosome number of n = 14, taro occurs in diploid (2n = 28) or triploid (3n = 42) forms (Coates et al. 1988, Wang et al. 2017. Both of these forms are present in Asia, Africa, and South America, whereas triploids are generally absent from Oceania (Coates et al. 1988, Ochiai et al. 2001, Matsuda and Nawata 2002, Chaïr et al. 2016. The diploid taro genome shows a wide range in estimated size, from 2 to 4 Giga base pairs [C-value mean (Bennett andLeitch 1995, Wang et al. 2017)], and within Micronesia and Polynesia the genetic diversity of taro cultivars is relatively low Aradhya 1991, Lebot et al. 2004). Indeed, comparisons between Papua New Guinea and Pacific Island cultivars revealed several identical genotypes between these two regions, consistent with the latter having been introduced from the former (Mace et al. 2006). Although genetic resistance to TLB has been identified in taro populations from Palau, Indonesia, and Papua New Guinea, all Hawaiian landraces are susceptible to TLB and many have been lost due to this disease (Miyasaka et al. 2012). Once established, TLB also forces costly changes to cultivation practices. Given the special cultural significance of taro in many Pacific Island societies, particularly in Hawaii (Abbott 1992, Cho et al. 2007, Rao et al. 2010, preservation of lineages is of substantial cultural and economic interest. Taro is typically vegetatively propagated by removing and planting suckers or vegetative propagules from the tops, but hand-pollination can produce large numbers of highly heterozygous hybrids that can be selected for disease resistance and desirable agronomic traits. A collection of taro plants maintained by the University of Hawaii and used in breeding programs includes all remaining Hawaiian landraces and several cultivars from Nepal, Indonesia, Thailand, Melanesia, and Micronesia (Cho et al. 2007). Over the past several decades, crosses between Hawaiian and non-Hawaiian cultivars have been selected for desirable characteristics and made available for commercial planting or further breeding (Shintaku et al. 2016). One such cross ('230' x '255', see methods) produced progeny that exhibited TLB resistance, but a previous linkage map developed from 240 high quality SNPs identified using a reference-free genotyping by sequencing [GBS (Elshire et al. 2011)] approach failed to demonstrate associations between QTL and TLB resistance (Shintaku et al. 2016). Thus, a high quality reference genome for improved single nucleotide polymorphism (SNP) calling and linkage mapping can serve to identify loci associated with TLB resistance and other desired agronomic qualities, and aid marker-assisted breeding programs.
Rapid advancements in sequencing technology and computer algorithms bring de novo assembly of high quality genomes within reach for non-model taxa, even for those with large, repetitive genomes (Goodwin et al. 2016, Paajanen et al. 2019. Genome assembly with short-read shotgun sequences carries the advantage of high sequencing depth at low per-base cost relative to long-read sequencing platforms; however, short reads are unable to traverse genomic intervals that span repetitive elements, which causes assembly fragmentation. Incorporating data from mate-pair libraries can help to reduce this fragmentation (Wetzel et al. 2011), but that strategy is less effective when applied to complex genomes with long [multi-kilobase (kb)] repeat elements. Alternatively, linked-reads (barcoded short-read libraries) retain long-range information while maintaining the advantages of short reads, and can produce a complete representation of the genome with high-levels of accuracy (Zheng et al. 2016, Weisenfeld et al. 2017. This genome sequencing strategy can be implemented with low DNA input, e.g., 1 ng (Marks et al. 2019), and is cost-effective. Paajanen et al. (2019) performed a direct comparison of de novo genome assemblies for the potato species Solanum verrucosum (722 Mb genome size), and showed that a single linked-read library sequenced in a single Illumina lane (92x coverage) produced an assembly comparable in quality to an assembly reconstructed from 65 Pacific Biosystems SMRT cells (50x coverage), at a fraction of the cost ($4k vs. $25k, USA dollars). Other examples of high-quality linked-read genome assemblies for non-model species include pepper (Capsicum annuum) and the African wild dog (Lycaon pictus), and this technology been leveraged to improve genome completeness for bottlenose dolphin (Tursiops truncatus) and Atlantic herring (Clupea harengus) (Armstrong et al. 2018;Hulse-Kemp et al. 2018, Kongsstovu et al. 2019Martinez-Viaud et al. 2019).
Coupled with a need to understand the genetic basis of TLB resistance, a taro genome reference and analysis of genotypic data from a mapping population that shows resistance to TLB infection will accelerate the understanding of the genomic underpinnings of that trait. Accordingly, the objectives of this study are to: 1) generate a de novo taro genome reference assembly; 2) use GBS data from a TLBresistant mapping population to construct a linkage map and test for associations between QTL and resistance to TLB; and (3) characterize basic genome architecture of Araceae using available high-quality genome resource data.

Genome material
The Hawaiian landrace 'Moi' was selected for genome sequencing because it is a grandparent to 'Parent 230' used in the '1025' TLBresistant mapping population (see below), and because it is highly regarded for its agronomic qualities. The DNA for genome sequencing was isolated from newly emerging taro leaf tissue using a modified CTAB protocol (Healey et al. 2014). Briefly, 1 to 1.5 g of tissue was ground in 10 ml of 100 mM Tris pH 7.5, 1.5 M NaCl, 2% CTAB, and 0.3% b-mercaptoethanol, incubated for one hour at 65°, then centrifuged for 5 min at 7600 RPM to pellet cellular debris. The DNA was extracted using a standard phenol-choloroform extraction protocol with three cleaning steps, chloroform, phenol:chloroform:isoamyl alcohol (25:24:1), then chloroform, with 0.5 volume added to the supernatant in each step. After the first chloroform cleaning step the aqueous phase was treated with 5 uL RNAse A (10 mg/mL) and incubated at 42°for 15 min. All mixing was performed by gently inverting the tube for several minutes, and the aqueous phase separated from phenol/chloroform by centrifugation at 7600 RPM. The DNA was precipitated from the aqueous phase by adding 1/5 volume 5M NaCl and 1.5 volume ethanol, spooled on a heat sealed glass Pasteur pipet, and transferred to a vial containing 70% ethanol for a 10 min soak. After repeating the ethanol soak step, the spooled DNA was briefly rinsed in 100% ethanol and then allowed to air dry for several minutes. Finally, the DNA was dissolved overnight in 350 to 400 uL of TE buffer (0.2mM EDTA). The DNA was quantified with a UV spectrophotometer (Eppendorf Biophotometer), which indicated A260/A280 ratios of 1.8 to 1.9 and A260/A230 ratios between 1.9 and 2.2. The DNA concentration was determined with a Qubit fluorometric spectrophotometer and subjected to a second purification using Zymo Research Genomic Clean and Concentrator Columns using the manufacturer's instructions.

Sequencing library preparation
The linked-read library was prepared by the HudsonAlpha Institute using the 10x Genomics (Pleasanton, California) microfluidic gel bead partitioning Chromium system (Zheng et al. 2016, Weisenfeld et al. 2017, and was sequenced on two lanes of an Illumina HiSeq X short-read platform. The DNA for this library was prepared by performing a high-pass size-selection to collect DNA fragments .40 kb using a 0.75% agarose gel cassette and external U1 marker for the Blue Pippin system (Sage Science, Beverly, MA, USA) following manufacturer's protocol, followed by concentration and cleaning with Ampure beads. The long-read sequences were generated on an Oxford Nanopore Technology ("nanopore") Min-ION sequencing instrument using R9.4.1 chemistry (Goodwin et al. 2015), with sequencing conducted at the University of Hawaii at Hilo. A total of 14 FLO-MIN106R9 cells were run using standard kits and reagents and following manufacturer's protocols. Cells routinely produced 3-5 gigabases (Gb) data, but a small portion of cells stopped prematurely [after 100 megabases (Mb) of data]. These failed cells were replaced by Oxford Nanopore Technology at no cost. The DNA for the nanopore long-read sequencing was sheared using Covaris G-Tubes, with most of the DNA running above 8 kb on a 0.7% agarose gel, consistent with the target fragment size.

Genome assembly
The taro genome was assembled from linked-reads and then gapfilled and scaffolded with contiguous segments (contigs) assembled from nanopore long-reads and linkage map results. The linked-read genome was assembled with Supernova software v. 2.1.1 (Weisenfeld et al. 2017) by the HudsonAlpha Institute. The nanopore long-reads were assembled with Canu v. 1.7.1 and Albacore v. 2.0.1 (Koren et al. 2017), but those contigs provided an incomplete representation of the genome, as detailed in results, so they were used only for gap filling and scaffolding of the linked-read assembly as implemented in quickmerge (Chakraborty et al. 2016). That program replaces gaps present in the recipient assembly ("query", linked-read) with sequences from the donor assembly ("reference", nanopore), discarding donor contigs without match to the query. The resultant gap-filled, linked-read assembly is hereafter referred to as the "merged" genome. For contig alignment, quickmerge implements MUMmer (Delcher et al. 2003, Kurtz et al. 2004) programs nucmer and delta-filter, with a threshold cut-off of 95% sequence similarity. To quantify similarity between assembled contigs of linked-read and nanopore assemblies those were pairwise aligned and assessed with MUMmer programs nucmer and dnadiff (Delcher et al. 2003, Kurtz et al. 2004, with settings delta-filters 1-to-1 alignment and minimum alignment length of 10 kb. Excepting the linked-read assembly, all analyses were performed on the University of Hawaii High Performance Computing cluster (server size: 20 core, 120 gigabyte RAM). Programs were run using default parameters unless otherwise noted above.
The haploid genome (1C) size of 'Moi' was estimated to be 2.39 Gb, based on the read k-mer profile analysis conducted in Supernova (Weisenfeld et al. 2017). The data inputs for the linked-read genome included 580 million linked-reads of mean length 150 bp (174 Gb), of which 89% were in proper read pairs. The nanopore MinION cells produced a total of 12.3 million long-reads, for a total of 63.7 Gb. For descriptive purposes, sequence data from one to three MinION cells were combined into a total of 11 groups, with summary statistics provided in Supplemental Table 1. To reduce computational demand during long-read contig assembly we discarded short long-reads (, 3.5 kb length, n = 6.5 million) and outlier long-reads (. 150,00 kb length, n = 43), thus retaining 5.8 million long-reads totaling 54 Gb. Accounting for the total numbers of sequenced nucleotide bases, read coverages for the linked-and long-read assembles were 60x and 23x, respectively.
To prepare for scaffolding with the linkage map, the merged assembly was filtered for assembly artifacts by removing: 1) stretches of Ns at the beginning and ends of scaffolds, 2) scaffolds consisting entirely of Ns, 3) duplicate scaffolds identified by the MUMmer (Delcher et al. 2003, Kurtz et al. 2004) package dnadiff or flagged by NCBI's genome filtering process, and 4) contigs/scaffolds containing potential contaminants. The latter were defined as producing blastn hits against 15,180 records of complete bacterial genomes available through NCBI's FTP site (accessed October 30, 2019), meeting the following cut-offs: 90% identity, length .100 nucleotides, and with match over .10% total contig length. Finally, the merged, filtered assembly was scaffolded using the linkage map (see below) to produce a pseudochromosome-level assembly.

Genome architecture
To gain insights into genome architecture of Araceae, the genomes of taro and great duckweed were assessed for repetitive content and identification of gene domains characteristic of nucleotide-binding leucine-rich repeat (NLR) plant disease resistance genes. These genes also occur in animals, and are crucial regulators of inflammatory and innate immune responses. The repetitive content was characterized by generating de novo repeat libraries with RepeatModeler v1.0.11 Hubley 2008-2015), which finds interspersed repeats by integrating RepeatScout v1.0.05 (Price et al. 2005) and RECON v. 1.08 (Bao and Eddy 2002). Based on each assembly's custom repeat library, interspersed repeats, simple repeats, and low complexity regions were quantified using the quick search option of RepeatMasker version open-4-0-9-p2 (Smit et al. 2013(Smit et al. -2015 run with RMBlastN 2.9.0+-p1 and the RepeatMasker combined database Dfam 3 [ (Hubley et al. 2016) download date 8-25-2019]. A second assessment of repeat content was constructed from all plant repeat databases available in RepeatMasker, including species databases monocotyledons (liliopsida), arabidopsis, rice, wheat, and maize, but those analyses were discarded after proving to be less sensitive than our custom repeat libraries. Relevant to TLB-resistance, the complement of NLR sequences was quantified with NLR-Annotator v0.7-beta (Steuernagel et al. 2015) using motifs defined by Jupe et al. (Jupe et al. 2012). NLR-Annotator identifies domains characteristic of the NLR gene family, and returns sequences from the start of the first domain, but not the start to end of the entire protein sequence. The advantage to this approach is that it is unbiased by protein prediction models. The great duckweed genome was included in the NLR analysis for comparative purposes.

TLB-resistant crosses and phenotypic analysis
The TLB mapping population '1025' was created by crossing two breeding lines '230' and '255' (see also Shintaku et al. 2016). Parent '230' is a cross between Hawaiian landrace 'Moi' · TLB-resistant Palauan landrace 'Dirratengadik', and Parent '255' is a hybrid between 'Sawahn Kurasae' (TLB-resistant landrace from Indonesia) and 'RMP-08'. 'RMP-08' is itself a hybrid between the Hawaiian 'Red Moi' and the Papua New Guinea 'PH15' cultivar. More than 2500 plants representing parents and progeny from 27 crosses were screened using a modified detached leaf disk assay (Brooks 2008) to define this particular TLB-resistant mapping population (Shintaku et al. 2016), crosses were repeated, and TLB resistance in several cultivars was confirmed by field tests. The '1025' mapping population was challenged with P. colocasiae isolates 'S1' and 'S39 collected from taro fields at Pepeekeo and Panaewa, Hawaii, respectively. Pure P. colocasiae cultures were established on a base of 10% V8 media plated into a petri dish (50 ml V8, 0.624 g CaCO 3 , 450 ml water, 10g agar) overlaid with 1.5% water agar plugs cut into a circle 0.5 cm in diameter, that was in turn overlaid with agar plugs from colonized plates. After establishment, cultures were grown by incubating on 10% V8 media at 27°for 5 to 10 days to obtain zoospores. The zoospores were collected from each culture plate by flooding with 10 ml of sterile distilled water followed by incubation at 4°for 30 min. The plate was then left for 20 to 25 min to adjust to room temperature, and then swirled gently and the water pipetted off and placed in a tube to count zoospores on a hemocytometer. The spore suspension was diluted to approximately 30 zoospores per microliter.
For disease challenges, 24 mm diameter leaf blade discs were cut from the '1025' mapping population using a cork borer. The second leaf blade from each plant (where leaf blade one is the first fully matured leaf blade) was used in the assay. Leaf discs were placed on 0.9% agar with the adaxial surface exposed. Four leaf blade discs from the same genotype were assayed in separate Petri dishes. In each assay, taro variety 'Bun Long' was used as a susceptible check cultivar. Leaf discs were inoculated with 10 ml water containing 300 P. colocasiae zoospores. The lesion diameter of the four discs was standardized to that of the susceptible cultivar. As an example, a relative lesion size of 0.5 indicates that the individual has half of the lesion diameter of the susceptible check cultivar.

Genotyping materials and SNP calling
The TLB-resistant mapping population consisting of Parents '230' and '255' and 92 of their progeny were sequenced using GBS (Shintaku et al. 2016) (NCBI SRX2754312), calling SNPs with the merged genome as a reference. To improve confidence in SNP calls, we included an additional taro GBS dataset comprised of 95 taro accessions that originated from Hawaii, South Pacific, Palau, and mainland Asia (Helmkampf et al. 2018;NCBI SRX2754311). This raised the total number of GBS samples to 189. For GBS, genomic DNA was isolated from freeze-dried or fresh leaf tissue using the Macherey-Nagel Nucleospin Plant II kit or Qiagen Plant DNeasy Mini kit according to each manufacturer's protocol. The GBS libraries were digested with the restriction enzyme PstI and prepared for sequencing at the Cornell University Genomic Diversity Facility (Ithaca, NY). Each set of samples was sequenced on a single flow cell lane of an Illumina HiSeq 2500 using single-end protocols to produce reads of length 100 bp. The GBS data were demultiplexed and barcodes trimmed using software GBSX v1.1.4 (Herten et al. 2015). Across all 189 GBS samples approximately 459 million usable reads were obtained. The Illumina sequences were mapped to the reference using Bowtie2 v2.2.4 (Langmead and Salzberg 2012) with setting -very-sensitive-local. Variants were called from the output of SAMtools v1.4.1 mpileup using the bcftools v1.2 multiallelic calling model (Li et al. 2009;Danecek et al. 2016), considering only uniquely mapped reads with threshold map and base quality phred quality scores of 20. After removing non-target samples, the parent and progeny variant call format (VCF) files were preliminarily filtered by discarding SNPs within 5 base pairs of an insertion/deletion (INDEL) site, removing INDELS, and removing samples with high levels of missing data (n = 8). Next, VCF files were iteratively filtered using VCFtools v0.1.14 (Danecek et al. 2011) to retain di-allelic SNPs with a minimum sequencing depth of at least eight reads per genotype, a per-locus data density of minimum 80%, and setting the minor allele frequency (maf) threshold to 0.012. Loci identified as invariant in the parents were removed, with allowance for missing data in one but not both parents. In preparation for linkage mapping and QTL analyses, the parental genotypes were phased and progeny genotypes imputed in beagle v5.0 (download date 16May19.351; Browning and Browning 2009).

Linkage map and QTL analysis
A linkage map was constructed using GBS data from 84 individuals of the '1025' mapping population that passed quality controls. The variant calls (SNPs) were filtered for segregation distortion, redundancy, homozygous condition in parents (genetically plausible because the '230' and '255' parental stocks consisted of multiple lines of clonal propagules), and unlinked loci. We aimed for 14 linkage groups, the number of haploid chromosomes in taro, and define "major linkage group" as having .8 markers per group. The grouping of SNP markers was performed using the OneMap package (Margarido et al. 2007) in Rstudio v1.1.423 (RStudio Team 2015) with R v3.5.0 (R Core Team 2019). Markers were anchored to their respective groups based on the OneMap output, and linkage maps were constructed using the CDM functionality of software package Genetic Analysis of Clonal F1 and Double cross populations (GACD) (Zhang et al. 2015) with setting linkage phases originally unknown. The Kosambi mapping function was used to convert the recombination frequency to map distance in centimorgans (cM) (Kosambi 2016). Software suggested logarithm of the odds (LOD) scores were used to construct and assign markers to different linkage groups (LGs), with a LOD score of 5.97 used for the final linkage map. The GACD software was also used to identify regions in the taro genome that correlated with the TLB resistance in the mapping population, with the mapping algorithm Inclusive Composite Interval Mapping of Additive and Dominant QTL (ICIM -ADD). ICIM is suitable for mapping a small population size as it controls for the bias due to the Beavis effect-i.e., the overestimation of explained phenotypic variance in small populations (Xu 2003). The LOD threshold was calculated by 1000 permutation tests at a = 0.05. The QTL mode of action (i.e., selection model) was calculated using the method of Muchero (Muchero et al. 2013), where 'a' and 'd' are the additive and dominance effects respectively. The mu(ac) and mu(bd) are the phenotypic means for the heterozygous loci having alleles from the same species, and mu(ad) and mu(bc) are the phenotypic means for the heterozygous loci carrying alleles from both species. The ratio of d/a is used to assess the QTL mode of action: a d/a ratio of ,1 indicates underdominance, ratio between 0 and 1 indicates partial dominance, and a ratio of .1 indicates over-dominance (Muchero et al. 2013).
Using the final linkage map as a guide, contigs were anchored and concatenated into pseudochromosomes, orienting each contig to the linkage map position starting from marker position 0 and maintaining directionality of contigs when .1 markers per linkage group corresponded to .1 SNP loci per contig. The correspondence between markers assigned to linkage groups and SNP positions on the linkage map was further applied to assign confidence rankings using a 4 point scale. A high-confidence score of 1 was assigned when markers in a linkage group were consistently ordered with SNPS on single contigs and separated by a reasonable distance, defined as exceeding that of the median sequence length used to generate the assembly. A score of 2 was assigned to markers that corresponded to a single SNP on a single contig, and a score of 3 was assigned to markers that corresponded to clustered SNPS (defined as , median sequence length of the assembly), or were ordered discordant from the linkage group map distance. Last, loci from single contigs assigning to .1 major linkage groups were scored as 4. These contigs were included in each LG pseudochromosome, therefore a small portion of the linkage groups contain redundant sequences. Only markers of score of 1 were used to assign contig directionality during contig concatenation into pseudochromosomes.

Data availability
The pseudochromosome-level reference genome assembly has been deposited at DDBJ/ENA/GenBank under the accession WUBK00000000, BioProject PRJNA567267. The version described in this paper is version WUBK01000000. File S1 contains detailed descriptions of supplemental results (Tables, Figures, and Appendixes), and documents archived in figshare (linked-read and merged draft genomes, nanopore contigs, the vcf file used for linkage mapping and QTL analysis, the R script used to construct linkage groups, phenotypic data for the mapping population, and the de novo repeat library from RepeatModeler). The raw Illumina sequences and corresponding 10x Genomics barcode files are available through the NCBI Sequence Read Archive (SRA) SRP223785 and SRS5458621, and the Oxford Nanopore Technology long-read sequencing files (prior to self-correction) are available through the BioProject ID PRJNA567267. The previously published GBS datasets analyzed in this study were obtained from the NCBI SRA database under BioProject PRJNA381383 (Accessions: SRX2754311-12). The SNP calling and filtering pipeline is available online (Bellinger 2019).

Genome assembly characterization and quality metrics
The pseudochromosome-level taro genome is estimated to be mostly complete, with the total number of assembled bases, 2.45 Gb, slightly larger than the k-mer estimated genome size of 2.39 Gb. The assembly's largest scaffold was 38 Mb, and after excluding a large number of short contigs (, 5 kb), consists of 55,692 scaffolds representing 2.24 Gb (Table 1). The quickmerge step did little to improve genome assembly quality, decreasing fragmentation by only 2k contigs and increasing assembly size by only 119.8 Mb. That low amount of gap-filling can be explained by the majority of nanopore contigs failing to align to the linked-read assembly. While nanopore long-read sequencing resulted in 1.44 giga assembled bases (72,696 contigs), only 602 Mb (, 42%) of those assembled bases aligned to the linked-read genome assembly at the 95% identity threshold. In Table 1, the slight difference in number of assembled bases between the merged and pseudochromosome-level assemblies is because numbers are reported for the unfiltered merged genome, to maintain direct comparability with the (unfiltered) linked-read genome.
Based on the benchmarking gene set for monocots, completeness of the taro genome assembly is close to the high-quality, chromosome-level duckweed genome assembly, with overall recoveries of 80% and 84% BUSCOs (Figure 1). The proportion of recovered complete and single copy BUSCOs was nearly identical for the two species, 73% and 74%. Turning focus to the percent of complete and duplicated BUSCOs, the value of 3.4% for taro genome was only slightly higher than the value of 1.5% for the duckweed genome. As an outlier, the genome assembly of yam, sequenced using nanopore technology, showed a disproportionate number of duplicated benchmarking genes. Compared to the phylogenetically diverse group of monocots included in our analysis, taro and duckweed exhibited a higher fraction of missing or fragmented genes, with 43% of the missing or fragmented genes of taro also fragmented or missing in duckweed. Given the high-quality of the great duckweed genome, this finding suggests that members of Araceae differ from other monocots by types of gene losses. The BUSCO recovery rates for the non-Araceae monocots were similar regardless of whether their genomes were represented in the liliopsida_obd10 reference gene set (asparagus, eelgrass, orchid, and sorghum). Although results based on the eukaryota_ odb10 benchmarking gene set shows taro to have a greater BUSCO recovery rate, encompassing 88.2% genes complete (79.2% single copy; 9.0% duplicated), 5.1% fragmented, and 6.7% missing, that assessment is less comprehensive because that database contains substantially fewer genes than the benchmarking gene database for monocots.

Genome architecture
De novo repeat modeling revealed that the genome of taro is highly repetitive, with RepeatMasker screening estimating that 80% is comprised of mobile genetic elements (Table 2). Comparing the genome of taro to that of great duckweed indicated that the latter had substantially lower mobile genetic element content, 20%. The proportion of long terminal repeat (LTR) elements in taro and duckweed genomes, 36% and 9%, was proportionally similar to their unclassified repeat content, 38% and 11%. The genomic proportion of long interspersed elements (LINEs) was low across both taxa (,=1.6%), with only LINE1 types detected in the custom repeat libraries. Short interspersed element (SINEs) were detected only in the duckweed genome. The simple repeat and low complexity contents were also low in both taxa, ,= 2.3%. Cumulatively, the total repeat content of taro and duckweed genomes was 82% and 23%, respectively.
Analysis of taro with NLR-annotator identified 391 NLRs, of which 182 were categorized as complete, 109 complete but pseudogenes, and a further 80 annotated as partial only, which indicates a lack of some NLR specific domains (Supplemental Appendix 1). In contrast, the great duckweed genome was estimated to contain only 70 NLRs, of which 46 were categorized as complete, 12 partial, and 12 complete pseudogenes. Members of the NLR gene family divide into two groups, Toll and human interleukin-1 receptor (TIR) proteins (known as TIR-NB-LRR or TNLs) and the non-TIR class of Nucleotide-binding site leucine-rich repeat proteins known as CNLs (also known as CC-NBS-LRR) because some (not all) contain a coiled-coil (CC) structure in the N-terminus domain (Jupe et al. 2012). Among the taro NLRs we detected all were characterized as belonging to the non-TIR class, CNL, with only six exceptions. Those six TNLs were evenly split among categories complete, partial, and pseudogenes (one complete and one partial). The great duckweed genome contained only CNLs, with the exception of one TNL annotated as a pseudogene.

GBS data and genetic linkage mapping
The per-sample average number of GBS reads was 2.26 million, of which 1.5 million mapped to exactly one genome location (Table 3).
From the 1,519 filtered SNPs calls using those mapping data, 1,423 passed the initial test for segregation distortion, but 865 of those markers were discarded because of homozygosity in both parents (n = 107), redundancy (n = 514), deviance from Mendelian segregation ratio (n = 179) or unlinked status (n = 65) (Table 4). Accordingly, 558 (39.21% of 1,423) SNP markers were assigned to 31 linkage groups, which equates to a total length of 285 Mb and represents 11.64% of the genome (Table 5). Taro has a haploid chromosome of x = 14: in our analysis only 14 linkage groups had number of markers . 8 and there was a heavy reduction in the number of markers assigned to linkage groups 17 through linkage group 31 (Supplemental Figure  1). Thus, we included the first 16 linkage groups for linkage map development and QTL analysis. The final linkage map has a total distance of 4094.38 cM, 16 linkage groups, n = 558 markers, and covers a total of 285,477,188 nucleotides (10.83% of the genome; Figure 2 and Table 6). Concatenating contigs into pseudochromosomes based on linkage groups and marker scores (Supplemental Appendix 2), along with artifact filtering, reduced the number of contigs to 140,400, as reported in Table 1.

QTL analysis based on the genetic map
Composite interval mapping identified a total of 10 QTL associated with taro's resistance to two isolates of P. colocasiae (Table 6). Eight QTL (four per isolate) were unique, whereas one QTL mapped to the same interval (LG6, TIG00044844_224587 and TIG01551418_ 41957) in both isolates, explaining 17.52% and 9.81% of phenotypic The percent of BUSCO genes found in each genome is listed for categories single (S) or multiple copies (D), as well as fragmented (F) and missing (M). Analyses are based on the BUSCO liliopsida_odb10 dataset representing class monocot (n = 3,236 genes). See text for scientific names and NCBI genome accession identifiers.
n■ Table 1 Descriptive characteristics for three draft taro genome assemblies. The pseudochromosome-level taro assembly ("Ps_chr") was composed from a linked-read assembly ("LR") that was gap-filled using contigs assembled from nanopore MinION long-reads (merge step, "Merged"), filtered for assembly artifacts, and then concatenated into pseudochromosomes using a linkage map. Kilobase = kb # Scaffolds (% of assembled nucleotides) variation for S1 and S3 isolates respectively (Table 6, Figure 3). In terms of genetic selection models, the mode of action for this QTL differed between the two isolates: partial dominance was indicated for the S1 isolate, while underdominance was indicated the S3 isolate. Two QTL showed evidence of overdominance (d . 1), and of the remaining six QTL (with d , 1), three showed partial dominance and three showed underdominance, as indicated by positive or negative values of d in Table 6. The locations of QTL significant for resistance to TLB respective to all markers on LGs are shown in Supplemental Figure 2. Among the total complement of sequences containing NLR domains, 25 complete NLRs were located within sequences assigned to LGs and tested for QTL association with TLB resistance. Although none of the NLR domains were contained within those QTL, some were within close proximity. A cluster of three NLRs (1405_nlr_1, 1405_nlr_2, and 1405_nlr_3) occurred on LG6, near the QTL associated with resistance to both S1 and S3 P. colocasiae isolates (Figure 3). These NLRs were characterized as CNLs, and contained a pre-NB domain. The highest concentration (n = 9) of complete NLRs occurred on LG18, representing 5% of taro's genome-wide NLR complement, yet no QTL significant for association with TLB mapped to that particular LG. Among the nine partial NLRs that mapped to LGs, only one (NLR, 406_nlr_1) was proximate to a QTL, in this case mapping to LG8 in the interval between QTL qS1_DPI4-8-1 and qS1_DPI4-8-2.

DISCUSSION
We sequenced and assembled a taro genome using a linkedread sequencing strategy, with genome contiguity improved through gap filling and scaffolding using contigs assembled from Oxford Nanopore MinIon long-reads and linkage map results from a mapping population for TLB-resistance. By constructing a pseudochromosome-level reference genome, we provide a foundational resource for identifying genomic elements important for agronomic traits and resistance to disease. This draft taro genome is mostly complete, based on the total number of assembled base pairs compared to the k-mer estimated genome size and BUSCO results. Efforts for assembly improvement can focus on increasing contiguity through application of longer-range scaffolding data such as in vitro Hi-C (Dovetail) or optical mapping (BioNano Genomics).
For large, highly repetitive genomes, assembly with noisy longreads may be particularly challenging because of the requirement to disambiguate highly-similar, but non-homologous sequences. Assembling mobile repetitive elements requires sequencing through entire repeat elements that can exceed long-read lengths. For example, a survey of 50 plant genomes indicated that retrotransposons averaged 8,611 bp (Ou and Jiang 2018) in length, and can exceed .30 kb (Xu et al. 2010, Ma et al. 2019. In this study nanopore contigs were assembled from nanopore long-reads using Canu (Koren et al. 2017), which includes a read error correction step, yet sequencing depth may have been insufficient to achieve the levels of self-correction necessary to assemble homologous sequence components. Systematic error in raw nanopore sequencing data are known to lead to reduced accuracy of Canu-corrected reads, for example, at theoretical 30x coverage reads averaged 92% identity to a human genome reference after correction (Jain et al. 2018). Our longread coverage at 23x recovered 1,44 Mb of taro genome sequence (60% of the estimated genome size), but only 602 Mb (, 42%) of those assembled bases aligned to the linked-read genome assembly (at the 95% identity threshold). Instead of collecting additional long-read n■ Table 2 Repetitive content of taro (Colocasia esculenta) and great duckweed (Spirodela polyrhiza) genome assembles. Total repeat content was quantified using de novo repeat libraries constructed with RepeatModeler and screened with RepeatMasker. The percent (%) of sequence is relative to each individual assembly's total length excluding runs of NNN"s between scaffolded contigs. Short and long interspersed elements are denoted as SINEs and LINEs data, we opted to generate a new linked-reads assembly, restricting use of the nanopore contigs for gap-filling and scaffolding only. Our results show that genome contiguity was only slightly improved by that gap-filling (merge) step, based on N50 values. We note here that nanopore genome polishing with Illumina short-read data only-e.g., nancorr and MaSurCa approaches (Zimin et al. 2013, Goodwin et al. 2015-would have been a far less effective approach for genome improvement, because short reads cannot polish regions that have ambiguous mappings, as occurs for repetitive regions (Jain et al. 2018). In today's dollars the nanopore contigs were sequenced using a 12-cell kit that, with reagents, cost $12,500 USA dollars, in contrast to the 10x Genomics genome assembly that cost $5000 USA dollars, with HudsonAlpha preparing and sequencing the linkedread library and providing bioinformatic and assembly support. Which type of platform and sequencing depth is most suited to a given project is of course contingent on project needs and budget, depending on whether the genome is a means to an end or the end goal (Chakraborty et al. 2016). In our case, our immediate goal was to identify QTL associated with TLB and identify SNPs markers for rapid screening of progeny in taro breeding programs, which required high sequencing accuracy.

Genome architecture
Transposable elements (TEs) are self-replicating, mobile genomic units that are major contributors to genome diversity and size, and have key roles in chromosome structure, gene expression, and regulation (Bennetzen andWang 2014, Orozco-Arias et al. 2019). Plant genomes are known for proportionally high levels of TE repeat content, for example, up to 80% of genome sequence in corn (Zea mays) and Triticum urartu, a progenitor to wheat (Ling et al. 2018). The TEs fall into two categories, Class I retrotransposons and Class II DNA transposons, with the former typically more ubiquitous in plants. To illustrate this point, Class I retrotransposons contributed to 21-72% of total genome sequence in five monocot cereal grains and grasses (Ling et al. 2018), and 30-41% of total genome sequence in Saccharum spp. (sugarcane, a monocot) (Garsmeur et al. 2018), compared to Class II transposons, which contributed , 8% of total genome sequence in each of those taxa. Our findings are consistent with the above monocot examples: the two Araceae genomes considered here exhibit lower and upper bounds of mobile element repeat content-20% and 80%-with retrotransposons identified as the dominant type of mobile element (a similar proportion of mobile elements went uncharacterized). The spread we observed in repeat content of Araceae is expected because of the smaller genome size of great duckweed (Michael et al. 2017) and the well-described variation in genome size of taro (Wang et al. 2017), which may be attributable to repeat content in addition to variation in chromosome numbers. The NLR class of plant disease resistance genes show phylogenetic differences in NB-ARC domains, in addition to presence or absence of TIRs (TNL vs. CNL), and differ considerably in their phyletic distribution (Tarr and Alexander 2009, Yue et al. 2012, Steuernagel et al. 2015. Our findings are consistent with a low frequency of TNLs in genomes of monocotyledonous species (Cannon et al. 2002, Tarr andAlexander 2009): we found only 6 TNLs in taro, of which only two were complete, and only one pseudogene in duckweed. Thus, consistent with previous studies of monocots, CNLs represent a major n■ Table 4 Linkage mapping results for GBS data mapped to the merged taro genome reference. Markers passing the initial test for segregation distortion are listed as number of initial markers. Homozygous, redundant, and distorted SNPs were removed from the subsequent analysis. After binning of unique markers and filtering for segregation distortion, a suggested logarithm of the odds (LOD) score was calculated and used for linkage group (LG) formation  component of the repertoire of resistance genes in taro. The number of CNLs identified for taro is consistent with numbers observed in other monocots, averaging 300 to 400, and the fewer number of CNLs in duckweed is consistent with contraction of numbers of NLRs in aquatic plants (Baggs et al. 2019), in conjunction with the reduced genome size of great duckweed.

Linkage mapping and QTL analysis
Although we were not able to resolve exactly 14 linkage groups, this outcome is consistent with other studies conducted using clonally propagated heterozygous species (Nzuki et al. 2017, Soulard et al. 2017. In a study of taro, Soulard et al. (2017) (Shintaku et al. 2016). With a reference genome we were able to double the number of markers, however we were also able to recognize that a portion of our markers on the linkage map were ordered discordant from the order of SNPs on contigs, which in some cases were present in dense clusters (score 3, S Appendix 2). While stretches of high heterozygosity could be real-as expected given the hybridization history of taro-these also could be caused by mis-assembly, either the collapse of non-homologous reads originating from repetitive regions or chimeric joins, both of which compromise calling SNPs from short-read data. Alternatively, a minor contributor to presence of SNP clusters could arise from NLR proteins, concordant with overdominance (heterozygote advantage) in plant disease resistance genes (Michelmore and Meyers 1998). We found that some of the NLRs encompassed hypervariable genomic regions, for example, LG22 contained a complete NLR protein, 261850_nlr_1, which contained 4 SNPs across 2141 nucleotides. Another case involves LG3, with five SNPs contained in the partial n■ Figure 3 Genome wide representation of major linkage groups, including significant QTL and their LOD scores for S1 (A) and S3 (B) isolates of Phytophthora colocasiae exposed to the '1025' mapping population. The Y-axis indicates the logarithm of odds (LOD) values. Peaks above the threshold (dotted line) of LOD = 15.32 (S1 isolate) and LOD = 6.01 (S3 isolate) represent a QTL having significant interaction with the TLB tolerance. Linkage groups .100 centimorgans (cM) are labeled, with total length shown on y-axis. Approximate locations of nucleotide binding leucine rich proteins (NLRs) (complete, partial, and pseudogenes) are indicated by colored symbols.
NLR protein (NLR_291178), spanning 721 nucleotides. Overall, the linkage map and QTL results provide foundational resources for future work aimed at identifying functional genomic elements that underpin plant ability to resist diseases, including TLB. In total, we found 10 QTL associated with TLB resistance. Of the 10 putative QTL identified in our study, 9 were isolate-specific with only one in common between the S1 and S3 Phytophthora isolates. This result is similar to other studies of Phytophthora host interactions where isolate-specific effects were found (Leonards-Schippers et al. 1994, Johnson et al. 2012, Truong et al. 2012. This phenomenon underlines the need for stacking resistant genes, because multiple races and mating types of Phytophthora are found in Hawaii (Shrestha et al. 2014).
The spread of TLB poses a major threat to taro growing regions currently free from this devastating disease. Incorporating disease resistance is the most sustainable approach to manage TLB, as research in the Pacific indicates that management measures such as chemical control are largely ineffective (Singh et al. 2012), although in Cameroon (Africa) the application of fungicide to taro was shown to reduce impacts from disease (Tarla et al. 2014). Securing taro genetic resources for future use, along with retaining culturally important types allows for the maximum flexibility in deploying new cultivars.

CONCLUSIONS
In this study, we generated a high-quality genome assembly for taro, a root crop that is widely cultivated in tropical regions and is important for food security. Our results will inform studies of the origin, evolutionary history and breeding of this South Pacific crop. In addition, this genome may provide a framework for surveying mechanisms that underlie the formation of distinct morphological features associated with tropical tuber crops. This genome may also stimulate new genetic insights into this important tropical species. The QTLs identified from genotypes of a taro mapping population resistant to TLB can be further investigated to elucidate the genetics of that trait, with the complement of NLR sequences available as a starting point for advancing understanding of resistance system functionality. Further, the QTLs can be used to accelerate marker-assisted breeding programs. Finally, this genome project may provide a template for how to develop genomic resources in other understudied plant species.

ACKNOWLEDGMENTS
We acknowledge technical support and advanced computing resources from the University of Hawaii Information Technology Services -Cyberinfrastructure and Andrew Read for guidance with identifying and annotating NLR domains. The US Department of Agriculture, Agricultural Research Service is an equal opportunity/ affirmative action employer and all agency services are available without discrimination. Mention of commercial products and organizations in this manuscript is solely to provide specific information. It does not constitute endorsement by USDA-ARS over other products and organizations not mentioned. The authors declare no conflicts of interest, no disputes over the ownership of the data presented in this paper, and all contributions have been attributed appropriately via coauthorship or acknowledgment as appropriate to the situation. This material is based upon work supported by the National Science Foundation under Grant No. 1345247. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Dr. Vernon Oi provided funding for materials and reagents for the nanopore genome sequencing.