Pseudo-chromosome–length genome assembly of a double haploid “Bartlett” pear (Pyrus communis L.)

Abstract Background We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII long-read sequencing, Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. The sample selected for sequencing is a double haploid derived from the same “Bartlett” reference pear that was previously sequenced. Sequencing of di-haploid plants makes assembly more tractable in highly heterozygous species such as P. communis. Findings A total of 496.9 Mb corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 pear chromosomes. Approximately 50% (247 Mb) of the genome consists of repetitive sequences. Gene annotation confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted. Conclusions We showed that the use of a doubled-haploid plant is an effective solution to the problems presented by high levels of heterozygosity and duplication for the generation of high-quality genome assemblies. We present a high-quality chromosome-scale assembly of the European pear Pyrus communis and demostrate its high degree of synteny with the genomes of Malus x Domestica and Pyrus x bretschneideri.


Introduction
The genomics era has revolutionized research on fruit tree species and many of these genomes have recently been sequenced, or are currently being sequenced [1][2] . Nevertheless, although the cost for sequencing genomes has dropped considerably, obtaining high quality assemblies and annotations for complex plant genomes is still challenging 3 . In addition to high numbers of repeats and transposable elements, high levels of heterozygosity complicate genome assembly for most fruit trees. Indeed, outcrossing fruit tree species often exhibit extremely high levels of heterozygosity with, for instance in apple 4 , one single nucleotide polymorphism (SNP) every 50 base pairs (bp). The traditional solution to circumvent the challenge of heterozygosity is to sequence highly inbred plant material [5][6] . However, such material may not always be available and many sequencing projects have used heterozygous samples for sequencing of economically important cultivars 7,8 .
Earlier assemblies of Asian pear (Pyrus × bretschneideri) 8 , European pear (Pyrus communis) 9 , and apple (Malus × domestica) 7 were based on heterozygous plant material, resulting in each case in erroneous and fragmented assemblies consisting of thousands of scaffolds. Both the Asian pear and apple genomes were subsequently re-assembled using different strategies to address the problem of extreme heterozygosity 2,8 . In the case of Asian pear the genome was re-assembled using a BAC by BAC strategy, combined with Illumina sequencing 8 . For apple, a double-haploid (DH) plant, derived from the same cultivar 'Golden Delicious' as the original reference genome, was sequenced 2 .
Here, we describe the assembly of the genome of the European pear (Pyrus communis) using a DH derived from the variety 'Bartlett', analogous to the strategy employed by Daccord et al. 2 in apple. The 'Bartlett.DH' developed at INRA, Angers, France 10 was chosen as it is derived from the same cultivar as employed for the previous European pear assembly, Bartlettv1.0, obtained by Roche 454 sequencing of extremely heterozygous plant material 9 . This new genome sequence (BartlettDHv2.0) was assembled by combining short read Illumina and long read PacBio sequencing, optical mapping, Hi-C, and genetic maps. The BartlettDHv2.0 genome assembly improves the European pear assembly to 17 pseudo chromosomes and will be a critical tool for contemporary genomic studies in pear, including genome-wide association studies (GWAS) and genomic selection (GS) for the benefit of pear breeding. There currently exists resequencing data from 113 pear accessions 11 representing both cultivated and wild pear species, but as yet no chromosome-scale reference genome has been published.

Genome sequencing and assembly
A total of 31.4 Gb of PacBio RSII long read sequence data was produced, comprising 3,665,270 reads with a read N50 of 14.2 kb. Reads longer than 10kb sum to 21.9 Gb. The RSII sequencing was supplemented by 123-fold coverage of Illumina (2x125bp) paired-end (PE) reads with a target insert size of 350 bp (61.5 Gb of sequence). Sequencing of two Hi-C libraries yielded 51.6 Gb of Illumina PE data as (2x125bp) reads. Kmer analysis of paired end Illumina data confirmed the homozygous nature of the 'Bartlett.DH' sample, with no heterozygosity peak visible in the 17-mer frequency distribution (Figure 1b vs. Figure 1a for Asian pear). Estimation of genome size from the 17-mer distribution provided an estimate of 528 Mb which agrees well with the 527 Mb genome size estimation made by Wu et al. 8 for Asian pear. The PacBio data therefore equates to 63-fold, long read coverage of the genome with 44 fold coverage in reads over 10kb.
The genome was assembled into 592 scaffolds totalling 496.9 Mb, or 94.0% of the expected genome size. The scaffold N50 is 6.5 Mb, which is a near 1,000-fold improvement over the Bartlettv1.0 assembly. Of these assembled scaffolds, 230 scaffolds totalling 445.1 Mb could be anchored to the 17 chromosomes of the pear genome using a combination of Hi-C data and the high-density genetic map. Thus 84.2% of the genome is anchored into 17 pseudomolecules with a further 51.8 Mb (477 smaller sequences) collected in linkage group zero (LG0). These metrics are summarized in Table 1. Searching of telomeric sequences (5'-TTTAGGG-3') enriched in the terminal parts of the pseudo-chromosomes allowed the identification of 22 telomeres. Out of the 17 pseudo-chromosomes, all but Chr14 and Chr16 had at least one of the two telomeres, and the presence of telomeric sequences could be found at both ends of seven of the pseudo-chromosomes (Chr4, Chr6, Chr8, Chr10, Chr11, Chr15 and Chr17). Additionally, SuperScaffold_290 was found to start with the 3'-CCCTAAA-5' sequence, suggesting it may be located at the terminal end of one chromosome. This hypothesis is also supported by three SNPs from this scaffold genetically mapping at the beginning of Chr02.
BUSCO analysis revealed 1,357 complete BUSCOs (94.3%, 64.0% single copy and 30.3% duplicated) with 1.9% fragmented and 3.8% missing BUSCOs. Marey maps 12 , showing the relationship between genetic and physical distance across each chromosome, demonstrate good agreement between the assembly and the Bartlett genetic map (Supplementary figures S1-S17); an example showing Chromosome 1 is provided in Figure 2. All plots have been produced using python's matplotlib v.3.0.3 library.
Using the high-density genetic map of Bartlett, a haplotype map for BartlettDHv2 was produced, allowing the identification of the two haplotypes of Bartlett in the BartlettDHv2 assembly, as well as the recombination breakpoints (Supplementary figure S1), confirming the gynogenesis origin of the BartlettDHv2 hypothesized by Bouvier et al 10 . Summary statistics of two assemblies produced using Canu 13 and Falcon 14 are shown in Table 2. The Canu assembly has higher contiguity (501 Mb in 620 scaffolds), while the Falcon assembly produces a slightly larger, but more fragmented result (515 Mb in 1,282 scaffolds). Both assemblies were used for the optical mapping data analysis and results for both the Canu and Falcon assemblies are shown in Table 3. While the total amount of sequence is similar in both cases, the Canu assembly produced fewer conflicts with the optical mapping data than Falcon (13 vs. 38), as well as much longer scaffolds (scaffold N50 of 8.1 Mb vs 3.5 Mb in Canu and Falcon, respectively). Alignment with the high-density linkage map indicated that the Canu assembly produced fewer conflicts with the genetic map than the Falcon assembly (3 vs. 8). The Canu assembly was therefore selected as the contig assembly.  Consensus was called on the assembly using PacBio WGS, Illumina WGS and Illumina RNA-Seq data. A single iteration of consensus calling using raw PacBio data was followed by polishing with Illumina WGS data. This Illumina consensus calling was performed iteratively while monitoring the number of kmers shared between the assembly and the Illumina read data. This metric reached a maximum value after seven iterations and Illumina WGS consensus calling was halted at this point. Finally, iterative consensus calling was run using RNA sequencing (RNA-Seq) data instead of the WGS Illumina data in order to focus the consensus on coding sequence. The rationale for this was that small errors are particularly a problem in coding regions because they can introduce frameshifts that severely affect the annotation of genes 15 . Metrics indicated that the consensus calling of coding regions was optimal after the second iteration. The second iteration of RNA-Seq consensus calling was therefore selected as the final scaffold assembly. Kmer comparison of the assembly with Illumina data (Supplementary Figure S37) does not suggest any significant amount of duplicated or missing content.  Combining scaffolds with proximity information from Hi-C sequencing enabled arrangement of the scaffolds into 17 ordered and oriented clusters representing the 17 chromosomes of the pear genome. Agreement of Hi-C clusters with the genetic map was not perfect but was very high, with 11 of the 17 Hi-C clusters being in perfect agreement with the genetic map. For such clusters, every anchored scaffold in that cluster is anchored to the same linkage group (LG) by the genetic map and no scaffold from another cluster was ever anchored to that LG. Comparison of the other 6 Hi-C clusters with the genetic map suggested that the Hi-C had correctly grouped and oriented chromosome arms. These clusters could be made to agree perfectly with the genetic map by splitting each of them into two. These remaining six clusters were therefore split and then re-joined in accordance with the genetic map.

Comparison of BartlettDHv2.0 assembly with Bartlettv1.0 assembly
The Bartlettv1.0 assembly totals 507.7 Mb (excluding N's), of which 99.8% (506.8 Mb of sequence in 141,034 out of the 142,083 original scaffolds) was aligned to the BartlettDHv2.0 assembly. Inter-assembly synteny is very strong, suggesting that although highly fragmented, the Bartlettv1.0 assembly was a true depiction of the genome. There is evidence of some haplotype separation in the Bartlettv1.0 assembly as 25,120 scaffolds totalling 25.6 Mb align to overlapping positions on the BartlettDHv2.0 assembly. Conversely, 1,974 scaffolds totalling 1.6 Mb, aligned to multiple places in the BartlettDHv2.0 assembly. These scaffolds represent repeats which are collapsed in the Bartlettv1.0 assembly, but not in the BartlettDHv2.0 assembly. This 1.6 Mb of repeat scaffolds from the Bartlettv1.0 assembly becomes 4.4 Mb of sequence in the BartlettDHv2.0 assembly, highlighting the importance of third generation, long read data in resolving the repetitive structures of plant genomes.

Gene annotation
The combination of ab initio gene prediction with protein alignment and cDNA alignment prediction enabled the annotation of 37,445 protein-coding genes in the BartlettDHv2.0 assembly. BUSCO analysis of predicted proteins revealed uncovering of 1,202 complete 'BUSCO' genes (83.5%, 62.7% single copy and 20.8% duplicated) with 6.2% fragmented and 10.3% missing 'BUSCO' genes. In total, 95% of these are supported by RNA-seq evidence. On average, gene models consisted of transcript lengths of 2,944 bp, coding lengths of 1,186 bp, and means of 10 exons per gene. These values are similar to those observed in Asian pear 8 , apple 2 , and the Bartlettv1.0 assembly 9 ( Table 4). All gene models had matches in at least one of the public protein databases (nrprot or interpro), while 95% of them contained domains recognised in the interpro database. The average gene density in BartlettDHv2.0 assembly is 7.1 genes per 100 kb, with genes being more abundant in subtelomeric regions, as previously observed in other sequenced plant genomes ( Figure 2, Supplementary figures S1-S17).

Orthology analysis
The predicted protein sequences from European pear were compared with those from eight other species, Pyrus x bretschneideri 8 , Malus x domestica (GDDH13) 2 , Fragaria vesca 16 , Prunus persica 17 , Rosa chinensis 18 , Rubus occidentalis 1 , Vitis vinifera 19 , and Arabidopsis thaliana 20 . Proteins were clustered into 20,677 orthologous groups (≥ 2 members), of which 8,877 (43%) were common to all nine genomes ( Figure 3). Full results of the orthology analysis are available from the pear project database on request. A set of 414 gene clusters were identified as being specific to the three pome fruits analysed (i.e. to apple and the two species of pear). A set of 611 gene clusters was identified as being shared by the two pear species but not by apple. A set of 8 gene clusters was found to be specific to the European pear, while 22 gene clusters were specific to the Asian pear and 7 gene clusters were found to be specific to apple.
Gene clusters that were determined by the orthology analysis to be pear specific, or specific to one of the three Malinae species (Asian pear, European pear and apple) were queried in the other Malinae genomes by aligning gene sequences with Genome Threader 21 . This gene sequence re-alignment revealed that, in most of these cases, gene clusters shown to be organism specific by the orthology analysis, revealed genes that were missed by the automatic annotation of the respective genome assemblies. All Asian pear and European pear specific gene clusters could be identified in one of the other Malinae genomes, while 5 gene clusters were found to be genuinely apple specific. Of the 611 pear specific gene clusters, 526 were found in the apple genome. Of the remaining 85 pear specific gene The results of this gene structure re-alignment highlight the limits of automated gene annotation and the importance of ongoing curation of gene structure annotations. An example of the importance of manual curation of gene models has recently been reported in kiwifruit, where more than 90% of the in silico predicted gene models were re-annotated compared to a previous draft version 22 . The annotation of the BartlettDHv2.0 assembly has been loaded into the online resource for community annotation of eukaryotes (ORCAE) 23 to facilitate ongoing manual curation of gene models.

Whole-genome duplication
Distributions of synonymous substitutions per synonymous site (KS) produced for the whole paranomes of P. communis, P. × bretschneideri, and M. × domestica all support the common whole-genome duplication (WGD) event shared by the Malinae. Signature WGD peaks in the KS plots for the three species can be found at almost identical KS values of ~0.16 (Figure 4 a,b,c), as expected based on previous research 2, [7][8][9] . Comparison of these WGD KS peaks with the KS peaks of ortholog distributions between pears and apple and between pears/apple and rose (Rosa chinensis) 18 suggest that the WGD occurred quite a long time after the divergence of Amygdaloideae and Rosoideae and well before the divergence of pear and apple (unless substantial substitution rate acceleration/deceleration occurred in these lineages).

Functional annotation and GO enrichment analysis
A combination of BLAST (NR prot) and interproscan searches enabled the annotation of 12,444 of the 37,445 genes (33%) with a functional description. Loading predicted transcripts into the TRAPID online annotation platform 25 enabled annotation of 24,257 (69%) genes with at least one GO term. GO enrichment analysis was performed within the TRAPID platform on gene sets of particular biological interest, i.e. pear specific gene families and pome specific gene families. No enriched GO terms were found for the pear specific gene families, while significantly enriched GO terms for the pome specific gene families are presented in the supplementary material.

Repetitive element annotation
A combination of de novo and homology based repeat annotation identified a total of 247 Mb of transposable element sequences accounting for 49.7% of the assembly. As is typical for plant genomes, the most abundant transposable elements are retrotransposons of the long terminal repeat (LTR) family, totalling 32.6% of the genome. Although widely dispersed throughout the genome, transposon-related sequences were most abundant in centromeric regions.
The recent reassembly of the apple genome 2 revealed a previously undescribed LTR element dubbed 'HODOR' (or High Copy Golden Delicious Repeat) and the expansion of this element was implicated as having a potential role in the speciation of apple and pear. This element has now been verified in the pear genome. BLAST analysis revealed 232 full length HODOR copies in the BartlettDHv2.0 genome, only 29% of the number of full length copies identified in the apple genome. Although the HODOR element has, to date, only been identified in the apple and pear genomes, this finding must be treated with a degree of caution. The apple and pear genomes have been reassembled using the latest long read technology to arrive at chromosome scale assemblies. HODOR is a 9.2kb transposable element, and as such it may simply not have been completely assembled in previous Rosaceae genomes based on short read data. Nevertheless, BLAST searches reveal no trace of the HODOR element in the recent chromosome scale reassemblies of Fragaria vesca 16 , Rosa chinensis 18 , or Rubus occidentalis 1 , all of which were developed from long read data.
Future in depth studies into the repeat content of Rosaceae genomes may reveal the point in the evolution of the Rosaceae, at which this element first emerged and how it relates to phenotypic differences among Rosaceae species.

Chromosome structure
All 17 chromosomes of the European pear genome displayed strong nucleotide level synteny with the recent chromosome scale assembly of the apple genome 2 ( Supplementary  Figures S18b-S34b). Although only a scaffold level assembly of the Asian pear is publicly available at this time, 1,913 of the 2,182 scaffolds (82%) from the Asian pear assembly can be aligned to the European pear assembly. The aligned scaffolds sum to 495 Mb or 99.5% of the Asian pear assembly. Of the 1,913 aligned scaffolds, there are 882 scaffolds totalling 403.8 Mb (or 81% of the Asian pear assembly) which align unambiguously to the 17 assembled pseudomolecules. Numerous small-scale inversions with respect to European pear are evident within Asian pear scaffolds and any of these small-scale structural differences could prove to be of biological interest. Figure 5 is included as an example of the inter-species and inter-assembly alignments showing chromosome 1. Alignments for all 17 chromosomes are given in Supplementary figures S19 to S34.
Self synteny of the genome based on colinear gene blocks reveals that the syntenic chromosome pairs for apple 7 and pear 10 (LG3 and LG11, LG5 and LG10, LG9 and LG17, and LG13 and LG16) are clearly identifiable ( Figure 6) and most collinear regions in strawberry correspond to two regions in European pear (Figure 7), as described for both apple and Asian pear 9,10 . Hence, the BartlettDHv2.0 assembly confirms that macrosyntenic chromosome structure is conserved across the Malinae.
Revision of the gene number relative to P. communis version 1 assembly.
Many scaffolds from the Bartlettv1.0 assembly align to overlapping positions on the BartlettDHv2.0 assembly. These overlapping scaffolds most likely represent assembly of both haplotypes at the same genomic locus. Over-assembly is a danger when assembling a highly heterozygous genome and such separation of the haplotypes led to over-estimation of the gene number for apple 2 7 Re-examination of apple gene predictions and removal of overlapping gene models enabled Wu et. al. 8 to arrive at a new, lower estimate of the gene number for apple. Gene annotation of the BartlettDHv2.0 assembly resulted in a lower number of predicted genes than reported for 8 the Bartlettv1.0 assembly 9 . When Bartlettv1.0 gene models were aligned to the BartlettDHv2.0 assembly and overlapping gene models were collapsed down to a single locus, only 37,997 independent gene loci were identified. Thus, the removal of overlapping genes brings the number of gene predictions for the two P. communis assembly versions much more closely in line.

Conclusions
Cost effective, high throughput, long read sequencing is democratising the effective assembly of complex genomes, particularly repeat rich plant genomes. These advances in sequencing technology have enabled the improvement, or complete re-assembly of the draft genome sequences which have been typical of non-model organisms, including those of Pyrus species. This new improved assembly of the genome of P. communis will enable step changes in the progression of genome based technologies for pear breeders, analogous to those being developed for Malus following publication of the Golden Delicious v3.0 assembly 26 . These include the ability to undertake genomic selection, and develop genetic markers based on candidate genes for traits of interest to breeders. These markers could be identified in the genome assembly following QTL mapping, or genome wide association studies. Such technologies will enable more efficient and targeted breeding of new varieties of pear with attributes that are desired by consumers and are also grower-friendly.

Breeding the doubled haploid plant from 'Bartlett'
In 1994, the European pear variety 'Bartlett' (synonymous 'Williams') was crossed as a female parent with the variety 'Passe Crassane' (male). Among the 971 seedlings obtained after sowing in the greenhouse in 1996, one showed the typical phenotype of pear haploid plants, i.e. a smaller size compared to diploid seedlings, with a slender stem and narrow, thin leaves of a pale green colour 27 . This haploid plant (referenced W65) was confirmed by flow cytometry and propagated in vitro until development was sufficient for chromosome doubling experiment which was performed in 1998 with oryzalin based on a protocol adapted from apple 28 . The doubled haploid plant W65DH (here called 'Bartlett.DH') was confirmed as homozygous by isozyme and microsatellite markers 27 and also with the recently developed 70K SNP array 33 (data not shown). 'Bartlett.DH' was grafted on rootstock 'Adams' and is kept in an experimental orchard at INRA, Angers, France.

Sample preparation and sequencing
For Illumina sequencing, genomic DNA from 'Bartlett.DH' was purified from young rolled leaves and young meristem tissue using the NucleoSpin Plant II DNA extraction kit (Macherey-Nagel GmbH, Düren, Germany), following the manufacturer's instructions. One Illumina PE library was constructed at CNAG-CRG, Barcelona, Spain, with 340bp insert size according to KAPA Library Preparation Kit with no PCR Library Amplification/Illumina series (Roche-Kapa Biosystems) protocol and sequenced on HiSeq2000 (v4) in a single lane. Illumina adaptor sequences were clipped using the Scythe (Scythe, RRID:SCR_011844) software. Illuimina sequences were trimmed at their 3' end where the average quality within a 25bp window fell below Q20 using the Sickle software. For the BioNano and PacBio single molecule real time sequencing, genomic DNA was extracted using a modified nuclei preparation method 29 identical to that used in 2 followed by an additional phenol-chloroform purification step. Thirty SMRT cells were sequenced on the Pacific Biosciences RSII platform with the P5-C3 chemistry at the Genome Center at UC Davis.

Hi-C library preparation and sequencing
The in situ Hi-C library preparation was performed according to a protocol established for rice seedlings with minor modifications 30 . The libraries were made from two biological replicates of 'Bartlett.DH'; for each replicate, 0.5 g of fixed leaves were used as the starting material. Due to the presence of large amount of cellular debris after isolation of nuclei, the nuclear pellet was divided into five parts prior to chromatin digestion with DpnII. The Hi-C libraries were sent to the Australian Genome Research Facility (Melbourne, Australia) for sequencing using one lane of 100 bp PE sequencing using a HiSeq2000 (Illumina Inc.).

BioNano Genomics genome mapping
Agarose plug embedded nuclei were Proteinase K treated for two days followed by RNAse treatment (Biorad CHEF Genomic DNA Plug Kit). DNA was recovered from agarose plugs according to IrysPrep™ Plug Lysis Long DNA Isolation guidelines (BioNano Genomics). Of the isolated DNA, 300 ng was used for subsequent DNA nicking using Nt.BspQ1 (NEB) incubating for 2 hours at 50°C. Labeling, repair and staining reactions were done according to IrysPrep™ Assay NLRS (30024D) protocol. Finally, labeled DNA molecules were analysed on a BioNano Genomics Irys instruments with optimized recipes using one Irys chip, one flowcell, 9 runs, with 270 cycles in total.
Data was collected and processed using IrisView software V 2.5 together with a XeonPhi (version v4704) accelerated cluster and special software (both BioNano Genomics, Inc.). A de novo map assembly was generated using molecules equal or bigger than 140 Kb, and containing a minimum six labels per molecule. In total, the molecules used for assembly encompassed 291 Mb equivalent space and on average 8 labels per 100Kb molecule size. For the assembly process, stringency settings for 'alignment' and 'refineAlignment' were set to 1e-8 and 1e-9 respectively. The assembly was performed by applying 4 iterations, where each iteration consisted of an extension and merging step.
Hybrid scaffolding was done using 'hybrid scaffolding_config_aggressive' settings of IrysView.

Genome assembly and scaffolding
The genome assembly workflow began with de novo assembly of contigs from the PacBio long reads using two tools, Canu (Canu, RRID:SCR_015880) version 1.5 and Falcon (Falcon, RRID:SCR_016089) version 0.5. For each assembler the most important assembly parameters were systematically varied (Supplementary Methods), as defined by the tool developers, and by consideration of assembly theory (e.g. overlap length, overlap identity for overlap layout consensus assembly). Optimal settings were selected by comparison of assembly statistics (total size assembled and contig N50) and by alignment of Illumina PE data to the assembly with bowtie2 27 (using the 'very fast' preset). For all PacBio assemblies the consensus step was performed by running Quiver (Genomic Consensus version 2.3.3) 28 (with default parameters) on raw PacBio contigs and using the full 63X of PacBio data.
Assembled contigs were further joined into scaffolds using a combination of BioNano optical mapping data, Hi-C chromatin conformation capture data, and genetic maps. The best assemblies from Canu and Falcon were independently combined with BioNano optical mapping data using the IrysView software to develop the Canu + BioNano(CB) and Falcon + BioNano (FB) assemblies, respectively. The BioNano scaffolding process identified conflicts between the assembled contigs and the optical map, indicating some degree of misassembly in both Canu and Falcon results.

Assembly Polishing
Pilon (Pilon, RRID:SCR_014731) version1.21 29 was run iteratively on the assembly, with Illumina sequence realigned to the polished assembly at each iteration and then alignments passed to Pilon to call the next consensus. Alignments for Pilon were produced using BWA mem (BWA, RRID:SCR_010910) v0.7.17 with default settings. Kmer spectrum comparisons were made using the kmer analysis toolkit (KAT) (version 2.3.4) 31 (KAT comp) at a kmer size of 32 and the metric used to assess each iteration was the number of kmers shared between the assembly and the Illumina reads. In a second consensus phase, RNA-Seq reads were aligned as single end (SE), to the genome using Hisat (version 2.1.0) 32 with default parameters. This time the effectiveness of consensus calling was assessed by analysis of full length alignments of assembled RNA-Seq transcripts. All transcripts designated as 'complete' by Evigene 33 were aligned to the genome with BLAT 31 (BLAT, RRID:SCR_011919), version 3.4, with 34 minimum match identity 90%. Alignments were filtered to retain only full length alignments (i.e. from query start to query end). Finally, the number of gaps in the alignments (query gaps + target gaps) was used as a metric with the rationale that this serves as a proxy for the number of indels in alignments of assembled mRNA sequence.

Scaffold Validation using a high-density genetic map
A high-density genetic map was developed using a 100 individual 'Old Home' × 'Bartlett' F1 population and the Axiom™ Pear 70K Genotyping Array 35 , Markers were filtered to have less than 5% missing data and fit segregation ratios of 1:1 and 1:2:1 (α = 0.01). Mapping was conducted in an iterative process using the maximum likelihood algorithm in JoinMap 5 (JoinMap, RRID:SCR_009248) 36 . After each round of mapping, a graphical genotyping approach was used to identify and fix marker order errors and regions with low marker density caused by segregation distortion. Markers that fitted segregation ratios of 2:1 and 2:3:1 (α = 0.01) were added to the dataset after a high-quality framework map was constructed to improve the low-density regions.
The Bartlett parental map produced by JoinMap included 11,474 markers. This map was used to validate and anchor the scaffolds from both the CB and FB assemblies. SNP probe sequences from the array 35 used in the construction of the genetic map were mapped to the assembly with BLAT (version 3.4) 34 . Alignments were filtered to retain only markers perfectly matched to unique loci in the assembly as well as those with a maximum of two mismatches in the second best hit. The resulting alignments were queried to identify problematic scaffolds mapped with SNP probes from different LGs. The number of scaffolds with SNP probes mapped from different LGs was used as a metric in the quality assessment of the FB and CB assemblies. After selection of the CB assembly, its scaffolds were broken at the 3 positions where SNP mapping switched from one LG to another. Each scaffold breaking was performed by dividing the scaffold at the position 500bp past the last good SNP marker.

Scaffold clustering and genome anchoring using Hi-C
Hi-C reads were aligned to the polished scaffolds in CB with Bowtie2 (Bowtie, RRID:SCR_005476), version 2.3.3.1 37 . Each of the paired reads were aligned independently then these single end alignments were subsequently merged as recommended by the LACHESIS developers. Based on the alignments, CB scaffolds were arranged into 17 ordered and oriented clusters using the LACHESIS software 38 . As an internal check, the process was completed on two different random 95% sub-samplings of the Hi-C data, as well as on the full data set. The clusters produced by all three of these LACHESIS runs were identical. LACHESIS produces groups of scaffolds which are ordered and oriented relative to each other. These scaffold groupings were compared with the genetic map and the consistency of these sources of information was assessed. The SNP probe mapping at the scaffold validation step was compared with the clusters produced by LACHESIS.

Illumina assembly
The Illumina data was also assembled on its own, using the de-Bruijn graph based assembler SOAPdenovo2 (SOAPdenovo2, RRID:SCR_014986), version 2.04. 39 This assembly was used in various ways during the course of the pear genome project (for further scaffold validation, for training the ab initio gene predictors, etc.). The Illumina data was assembled twice. The first pass contigs were screened using the Kraken 40 software and an index built from the entire RefSeq database. Reads aligning to contaminant contigs were removed and the remaining data was assembled again.

Transcriptome assembly
The 26.6 Gb 'Bartlett' RNA-Seq data (SRA accession numbers SRR1572981 to SRR1572991) was assembled de novo, using Trinity (Trinity, RRID:SCR_013048), version 2.2.0, 43 and also genome guided, using both Cufflinks (Cufflinks, RRID:SCR_014597), version 2.2.1, 44 and Trinity-GG, version 2.2.0 45 . All transcripts from these three assemblies were pooled and input into the EviGene pipeline 33 which produces a non-redundant transcript database classified into putative primary and alternative transcripts.

Gene annotation
Gene prediction was guided by the non-redundant transcriptome assembly, as well as by spliced alignments from three sources: CDS from closely related species (apple and Asian pear), proteins from these and other less related plant species (Arabidopsis, rice, tomato), and RNA-Seq read data aligned onto the genome. All assembled European pear transcripts classified as both full length and primary by EviGene were input to the ORF finder Transdecoder (version 3.0.0) 46 to give a set of predicted CDS sequences. These predicted CDS and CDS from closely related species were aligned to the genome using BLAT (version 3.4) 34 and Genome Threader (version 1.7.0) 21  Ab initio gene predictions were performed with Augustus and Eugene using models trained on the manually annotated Illumina sequence. Augustus was executed with hint files conveying information about the spliced mappings of RNA-Seq reads, assembled transcripts, CDS sequences and proteins and the repeat annotation of the genome. Similarly, these supporting hints were supplied to Eugene and the prediction was run on repeat masked sequence (with soft masking). The ab initio gene models from Augustus and Eugene were combined with the PASA gene models as well as the gene models produced by Genome Threader alignment of proteins, CDS, and assembled transcripts. The Evidence modeler software (version 1.1.1) 51 was used to combine these different gene models and evidence sources. Finally, the Evidence modeler annotation was taken and used to retrain Eugene. A final Eugene iteration using this Evidence modeler annotation as an evidence track helped to clean up the splice boundaries of some coding sequences.

KS-based paralog and ortholog age distributions
Paralog age distributions of synonymous substitutions per synonymous site (KS) were constructed as previously described 52 , except using PhyML (PhyML, RRID:SCR_014629) 53 instead of average linkage hierarchical clustering for tree construction. Briefly, to build the paranome, an all-against-all BLASTP search was performed with an E-value cutoff of 1 × 10 −10 , followed by gene family construction using the mclblastline pipeline (v10-201, micans.org/mcl 54 ). Gene families larger than 400 members were removed. Each gene family was aligned using MUSCLE (MUSCLE, RRID:SCR_011812) v3.8.31 55 , and KS estimates for all pairwise comparisons within a gene family were obtained through maximum likelihood (ML) estimation using the CODEML program 56 of the PAML (PAML, RRID:SCR_014932) package (v4.4c 57 ). Gene families were then subdivided into subfamilies for which KS estimates between members did not exceed a value of 5. To correct for the redundancy of KS values (a gene family of n members produces n(n-1)/2 pairwise KS estimates for n-1 retained duplication events), a phylogenetic tree was constructed for each subfamily using PhyML 53 under default settings. For each duplication node in the resulting phylogenetic tree, all m KS estimates between the two child clades were added to the KS distribution with a weight of 1/m (where m is the number of KS estimates for a duplication event), so that the weights of all KS estimates for a single duplication event summed to one.
KS-based ortholog age distributions were constructed by identifying one-to-one orthologs between species using InParanoid 58 with default settings, followed by KS estimation using the CODEML program as above. Coding sequences for M. × domestica and P. × bretschneideri were obtained from the apple GDDH13 genome project 59 and from the PLAZA dicot database 60 .

Collinearity and synteny
All-against-all protein alignments were also passed to the MCScanX software 62 to identify collinearity blocks. Self-collinearity of pear was plotted using the circle_plotter program bundled with MCScanX, after rebuilding the collinearity blocks with a minimum block size of 20 to reduce the noise level. Duplication depth of strawberry homologs in pear was counted with the dissect_multiple_alignment script bundled with MCScanX. DNA level synteny between P. communis, P. x bretschneideri, M. × domestica, and the two assembly versions for P. communis were all plotted using DGenies 63 with default parameters after aligning with minimap2 (version 2.16).

Data availability
Genome assembly and gene predictions have been submitted to the Genome Database for Rosaceae 64 and are freely available at the url 65 alongside tools like JBrowse and BLAST. The community annotation portal for P. communis is available for read only access at 66 . To participate in the ongoing manual annotation efforts please contact Yves Van de Peer. Supporting data, including the genome assembly, annotations, and raw sequencing information are also available via the GigaScience database, GigaDB 67 .    Inter species colinearity between F. vesca 16 and P. communis was interrogated using MCScanX and at each gene locus of the F. vesca assembly the number of P. communis -F. vesca inter species colinear blocks (duplication depth) was counted. The number of F. vesca gene homologs having each copy number in the P. communis (BartlettDHv2) assembly is then plotted. It can be seen that most gene loci from F. vesca occur twice in P. communis.