A genome sequence for the threatened whitebark pine

Abstract Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.


Introduction
Whitebark pine (Pinus albicaulis) is a 5-needle pine of subgenus Strobus, section Quinquefoliae, subsection Strobus.Sugar pine (Pinus lambertiana) is a closely related member of the same subsection whose genome was previously sequenced (Stevens et al. 2016).Whitebark pine is found in subalpine regions in the Western contiguous United States and Canada and is most often the tree-line tree species where it occurs.Whitebark pine is of significant and somewhat unique ecological importance.Its wingless seeds are harvested, dispersed, and cached by the Clark's nutcracker (Nucifraga columbiana).Thus, there is a mutualism between the tree and the bird to the extent they have coevolved (Tomback et al. 2001).In areas of joint whitebark pine, red squirrel (Tamiasciurus hudsonicus), and grizzly bear (Ursus arctos horribilis) habitat, whitebark pine seeds from cones cached in squirrel middens are an important food source for the bears (Mattson and Reinhart 1997).In addition, whitebark pine trees provide shade to the winter snowpack that helps extend the length of the annual snowmelt.
Unfortunately, for all its ecological importance to the subalpine environment, whitebark pine has become critically threatened throughout a significant part of its natural range (Tomback et al. 2001).The primary threat is mortality due to the introduced fungal pathogen white pine blister rust (WPBR) (Cronartium ribicola).Additional threats include mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate (Tomback and Achuff 2010).At some locations in the Northern Rockies and Canada, vast acreages of whitebark pine have suffered nearly complete mortality.In December 2022, after years of conservation efforts by the Whitebark Pine Ecosystem Foundation (whitebarkfound.org)and American Forests (americanforests.org), the United States Fish and Wildlife Service listed whitebark pine as a threatened species (US FWS 2022).
There is now an urgent need to conserve and restore whitebark pine throughout its natural range.This can be effectively accomplished if a very large number of WPBR-resistant and climate-adapted seed sources can be identified and if planting stock can be produced from those sources.Forest resource managers have for many years been developing such resources using phenotypically based approaches.Identifying WPBR-resistant sources involves finding putatively resistant trees in natural stands, collecting seeds from those trees, producing seedlings, and artificially inoculating seedlings with blister rust (Sniezko et al. 2008).This approach has been effective in several white pine species, notably sugar pine and western white pine (Pinus monticola); however, the discovery process is lengthy and expensive.Likewise, identifying climate-adapted sources employs longterm genetic testing in common gardens that can take decades to complete (Bower and Aitken 2008).Thus, any new technology that could speed up and reduce the cost of identifying seed sources for restoration would be highly desired.Genomic technologies offer one such solution.Just as has been done for human disease screening and for agronomically important traits in domestic crops and livestock, the specific genes underlying these traits must first be discovered.This is the long-term goal of our research.However, this discovery is profoundly enhanced by having a well-assembled and annotated reference genome sequence.To that end, in this paper, we report on the first reference genome sequence for whitebark pine.

Reference tree
An approximately 150-year-old tree was selected from the Deschutes National Forest near Bend, Oregon by a United States Department of Agriculture (USDA) Forest Service geneticist.The exact identification number and location of the tree are held in confidence to maintain its security.Scion from the tree was collected and grafted to rootstock; clones are maintained at the USDA Forest Service Dorena Genetic Resource Center in Cottage Grove, Oregon.Tissue from these clones can be obtained upon request.Cones and needle tissue were collected from the reference tree in 2006 and 2021, respectively.

DNA isolation
The protocol used to isolate the haploid megagametophyte tissue from a single fertilized whitebark pine seed was similar to previous conifer genome sequencing projects (Neale et al. 2014;Zimin et al. 2014).Haploid genomic DNA was extracted from a single megagametophyte with the Omega Biotek E.Z.N.A. SP Plant DNA Kit.The extraction followed the manufacturer's protocol with the following modifications: polyvinylpyrrolidone (0.01 g) was added to the tissue prior to lysis, and the lysis time was extended to 1.5 h.The extracted DNA was quantified on a Qubit 2.0 (42.2 ng/μL), a Nanodrop ND-1000 (A260/280: 1.83; A260/230: 2.11), and quality was evaluated on an electrophoresis gel (fragment sizes > 20,000 bp).

Illumina short read
DNA was sequenced at the DNA Technologies and Expression Analysis Core at the UC Davis Genome Center.First, DNA libraries were prepared for whole-genome shotgun sequencing with no unique molecular identifiers using 400-ng DNA and the QIAseq FX DNA Library Kit from Qiagen.Then, sequencing was conducted on 3.5 lanes of a NovaSeqS4 with Illumina 150-bp paired-ends sequencing with an approximate insertion size of 400 bp, nonoverlapping ends, and 75× coverage.See Fig. 1 for the flow chart of the sequencing and assembly steps.

Oxford Nanopore long read
For nanopore sequencing, a protocol similar to previous conifer genome sequencing projects was used (Scott et al. 2020;Neale et al. 2022).Because sequencing by the Oxford Nanopore Technologies (ONT) platform requires more DNA per run and cannot be amplified to maintain read length, needle tissue was used for DNA extraction and sample preparation.High molecular weight DNA was extracted following the protocol described in Workman et al. (2018).Briefly, tissue was ground in liquid nitrogen with a mortar and pestle for 20 min to properly disrupt tissue.This is followed by lysis in a nuclear isolation buffer (NIB) containing spermine, spermidine, triton, and β-mercaptoethanol in a 50-mL Falcon tube (Supplementary Table 1), with end-over-end rotation of the tube at 4°C for 15 min.The resulting lysed sample is filtered through a Steriflip and then centrifuged 1,900 × g for 20 min at 4°C.The supernatant was decanted, and the pellet was resuspended in 1 mL of NIB with a paintbrush.The resuspension was brought to a total volume of 15-mL NIB and centrifuged 1,900 × g for 10 min at 4°C.These steps were repeated (discard supernatant, resuspend pellet, and wash) until the supernatant was clear, usually 2-3 times.The final pellet was resuspended into 1-mL 1× HB buffer per gram of initial tissue.Nuclei can then be spun at 7,000 × g for 5 min, supernatant was removed, and pellets were snap frozen in liquid nitrogen and stored at −80°C for later DNA extraction.Extracted nuclei were then lysed and gDNA precipitated using the Circulomics Nanobind Plant Nuclei Big DNA Kit, alpha version (EXT-PLH-001).DNA was sheared to 25 kb with the Megaruptor 2, and library preparation was performed according to the ligation sequencing kit (LSK109, ONT).Then, 1 μg of purified genomic DNA was input into the ligation sequencing kit (LSK108-LSK109, ONT).Samples were sequenced on R9.4 flowcells on either the minION or PromethION and then base-called using guppy 4.011-5.0.13 depending on the time of sequencing.

Assembly
The initial contig assembly utilized both ONT and Illumina data with a hybrid approach, where the ONT reads were first corrected using the Illumina reads, and then the corrected reads were assembled.Following the strategy used in our previous work assembling loblolly pine (Zimin et al. 2014) and other conifers, the whole-genome Illumina libraries were prepared from haploid megagametophyte tissue collected from a single seed (Fig. 1).This resulted in the reduction of the effective genome size, lowered the resource requirements on the hardware, and produces a more accurate assembly overall.
The contigs were assembled with MaSuRCA v4.0.6 (Zimin et al. 2017a).MaSuRCA used the "super-reads" technique to compress high-coverage Illumina reads into low (2× to 3×) coverage of much longer super-reads by first constructing a k-mer graph from k-mers (k = 99 here) found in the Illumina reads.The k-mers become nodes in a k-mer graph, and exact overlaps of k − 1 bases between k-mers are the edges.The super-reads technique used the graph to extend each Illumina read in 5′ and 3′ directions as far as possible, as long as the extension was unambiguous.The extended read is called a super-read.Many Illumina reads extend to the same superread.The super-reads were then used to error-correct the ONT reads, essentially producing miniassemblies for each ONT read by using the ONT read as a template.This process yielded highly accurate "mega-reads," with typically one or a few mega-reads covering each ONT read.The mega-reads were then assembled with a modified version of the Flye assembler (Kolmogorov et al. 2019).
Table 1 lists the data that were used for the initial contig assembly of the whitebark pine genome along with the sizes of intermediate super-reads and mega-reads.The Flye assembler has an internal limitation of a total input sequence of 549 Gb.To stay within this limit, a subset of the longest mega-reads was used as input to the Flye assembler.The Flye assembly process was also modified.The assembler was interrupted after the initial contig (called disjointig in the Flye paper terminology) building stage to skip the initial contig consensus.This was necessary because the Flye consensus algorithm would otherwise attempt to create a >50-Tb file of alignments of mega-reads to the contigs and eventually fail on data of this size.The consensus step was not needed because the mega-reads supplied to Flye were highly accurate.After skipping the consensus, the assembly continued with the repeat resolution and scaffolding steps.This process is automated in MaSuRCA (as of v4.0.7 and higher).The new versions automatically perform the necessary steps when the detected genome size is over 10 Gb.The statistics for this initial contig assembly (v0.1) are listed in Table 2.The key metric in Tables 1 and 2 is N50, a measure of contiguity of sequencing reads or assembled contigs.It is defined as the length of the shortest sequence for which longer and equal-length sequences contain at least half of the total sequence in the read data or assembly.For assemblies, N50 is a weighted average of the contig or scaffold lengths.For long-read technologies, which generate reads with widely varying lengths, N50 is a weighted average of the read lengths.The initial assembly (v0.1) was performed with the MaSuRCA assembler.That initial assembly was followed by scaffolding with SAMBA and polishing with POLCA to yield assembly v0.9.That assembly was filtered for redundancy, scaffolded by the HiRise scaffolder, and then super-scaffolded into chromosome-sized scaffolds with the ALLMAPS software, followed by SAMBA gap closing and polishing with JASPER to yield the final assembly (v1.0).N50 contig size decreased going from v0.9 assembly to v1.0 assembly because HiRise scaffolder breaks contigs that are inconsistent with the HiC data.Consensus quality was evaluated with POLCA software.
The initial contig assembly was followed by long-read contigging/scaffolding with SAMBA scaffolder (Zimin and Salzberg 2022).The original, uncorrected ONT reads that were 10 kb or longer were used for SAMBA scaffolding.Some of these reads may have been omitted in the contig assembly because of the input size limitation of the Flye assembler.The first iteration of SAMBA was very conservative, requiring ONT reads to match for a minimum of 9 kb to the ends of 2 contigs to join them.In the second iteration, that requirement was reduced to 4 kb.The scaffolder merges contigs and computes the consensus sequence filling the gap using the sequence of multiple ONT reads spanning the gap.Therefore, the "patches" that filled the gaps may have a higher error rate.
The final step of the contig assembly was polishing the assembly with Illumina data in 2 passes using POLCA (Zimin and Salzberg 2020).The initial quality of the contigs after the SAMBA scaffolding was estimated to be 99.988% or QV39.After 2 rounds of POLCA polishing, the consensus quality was 99.999% or QV50, corresponding to an estimated error rate of 1/100,000 bases.These steps resulted in assembly v0.9 (Fig. 1), with statistics shown in Table 2.
Next, the contigs were scaffolded with OmniC reads sequenced from the needle tissue (a variant of the HiC proximity ligation technique) with the HiRise scaffolder (Putnam et al. 2016) at Dovetail Genomics (now part of Cantata Bio).After the HiRise scaffolding, redundant duplicate contigs were identified.These exist because assemblers frequently leave extra copies of repeats or extra copies of alternative haplotype sequences already represented in the contigs as short contigs in the assembly.All 20,661 "short" contigs that were shorter than 10,000 bp were aligned to the rest of the assembly with nucmer aligner.These contigs contained 96,950,513 bp of sequence with N50 of 5,302 bp.Any contig that was shorter than 10,000 bp and that aligned to an interior of another contig with >95% identity over >95% of its length was removed from the set of short contigs.The remaining 1,371 short contigs containing 6,341,804 bp were added back to the assembly.
Following scaffolding with the OmniC data and redundancy filtering, 2 linkage maps (Weiss et al. 2020;De La Torre 2023) for the closely related sugar pine genome were utilized to super-scaffold the assembly to obtain chromosome-sized scaffolds.The 2 maps had a total of 7,767 markers (mostly short sequences).Of these markers, 2,959 mapped uniquely to the whitebark pine scaffolds.ALLMAPS software (Tang et al. 2015) was utilized to produce chromosome-sized scaffolds using the alignments of markers to the scaffolds and marker positions in the map.The 2 final steps following the scaffolding were additional gap closing with the SAMBA tool (Zimin and Salzberg 2022) using the ONT reads followed by polishing with the JASPER polisher (Guo et al. 2023) that used the Illumina data (Fig. 1).This additional polishing was needed because in the places where gaps in the scaffolds were filled, consensus computed only from the ONT reads that spanned these gaps would have resulted in low-quality sequence.The statistics of this final assembly (v1.0) are listed in Table 2.

Structural annotation of the v0.9 genome
The v1.0 genome contains the same sequence as v0.9 with the only difference being that the contigs and scaffolds were rescaffolded into chromosome-sized scaffolds in v1.0, and a few scaffolds sequences were split.The v0.9 genome sequence was available earlier and the annotation was performed on that sequence given the scale of the assessment for the nucleotidebinding leucine-rich repeat receptor (NLR) classification.See Fig. 2 for the flow chart of the annotation steps.Initial assessment of the v0.9 reference genome was conducted with BUSCO v5.2.2 with the embryophyta database (odb10; Manni et al. 2021).Subsequently, repeat sequences were identified de novo with a combination of self-to-self comparisons and structural identification with RepeatModeler v2.01 (Flynn et al. 2020).The twice soft-masked genome was used as input to BRAKER v2.1.5as well as the aligned RNA-Seq reads from NCBI (Brůna et al. 2021).The set of predicted proteins was filtered with eggNOG v5.0.2 and evaluated with QUAST v5.2.0 (Gurevich et al. 2013) and BUSCO (embryophyta).In parallel, StringTie2 v2.2.1 was run using different sets of transcriptomic input (Kovaka et al. 2019).The first gene space assembly utilized only the HISAT2 aligned short reads as input to StringTie2, while the second assembly was run in hybrid mode including both the HISAT2 alignments and full-length assembled transcripts assembly aligned to the genome with Minimap2 (Li 2018).Protein-coding sequences were generated from all StringTie2 runs with Gffread v0.12.1 (Pertea and Pertea 2020) and frame-selected and filtered with Transdecoder v5.5.0 and eggNOG v5.0.2.Supplemental Transdecoder scripts were used to obtain the coordinates of the frame-selected transcripts in the context of the genome.Transcripts that did not have a corresponding genome alignment after this filtering step were removed from the final coding sequence and protein sequence files.The final proteins were evaluated with BUSCO (embryophyta), EnTAP v0.10.8, and AGAT v1.0.0 (Hart et al. 2020;Dainat et al. 2022).EnTAP was run as a reciprocal BLAST search to estimate the alignment rate at 50/50 coverage between the query sequence and target databases (NCBI's RefSeq v208 and UniProt).AGAT was employed to provide basic filtering for structural anomalies and quantify statistics regarding structural aspects of the protein-coding regions (Dainat et al. 2022).After EnTAP annotation, transcripts without a similarity search or eggNOG match were scanned for protein domains using InterProScan, and those lacking any identifiable protein domains were removed.

Annotating the v1.0 genome
Initial assessment of the v1.0 genome was conducted with BUSCO v5.4.5 with the embryophyta database.Realignment of RNA evidence against the scaffolded reference (v1.0) was provided as input to the EASEL pipeline with a filtering threshold of 0.65 to generate a set of protein-coding gene predictions (Webster et al. 2023).The final proteins were evaluated with BUSCO (embryophyta), functionally annotated with EnTAP, and summarized with AGAT (Fig. 2).
NLR identification on the v0.9 genome NLR proteins are a major family of plant disease resistance genes, which are categorized into the TNL, CNL, and RNL subfamilies based on their N-terminal domain.Three methods were utilized to generate a more complete representation of potential NLRs in whitebark pine: InterProScan, RGAugury, and NLR-Annotator.NLRs were identified from a de novo-assembled transcriptome, whole-genome scanning, and the genome annotation to provide comparison across the available genomic resources.
InterProScan v5.35-74.0 and RGAugury v1.0 identified NLRs from the protein sequences of the genome annotation through protein domain scanning (Li et al. 2016;Paysan-Lafosse et al. 2023).InterProScan was used to identify the NB-ARC, TIR, coiledcoil (CC), RPWB, and LRR domains using the Pfam, Gene3D, SUPERFAMILY, PRINTS, SMART, and CDD databases.The GFF3 file produced by InterProScan was filtered using a custom Python script to remove all entries without at least 1 NLR domain, to speed up the identification and classification steps downstream.Custom R scripts were employed to identify the NLRs and classify them into their subfamilies based on the N-terminal domain.Those with a TIR domain are TNLs, those with a CC domain are CNLs, and those with an RPW8 domain are RNLs.Subfamilies included both complete NLRs (containing N-terminal, NB-ARC, and LRR domains) and those missing just the LRR domain.Sequences without an N-terminal domain (NB-ARC only and NB-ARC-LRR) were considered unclassified.The RGAugury pipeline is quite similar, but it first implements a filtering step based on sequence similarity to the Resistance Gene Analog database before performing domain scanning with InterProScan.RGAugury was better able to identify CNL-type NLRs than InterProScan, which struggled to identify the CC N-terminal domain.
NLR-Annotator v2.0 was used to identify potential NLRs directly from the genome sequence using NLR-associated DNA motifs (Steuernagel et al. 2020).From the genome annotation, genes overlapping at least 80% of the predicted NLRs based on the NLR-Annotator boundaries were selected as potential NLRs with BEDTools v2.29 (Quinlan and Hall 2010) (Fig. 2).Custom R scripts were employed to combine NLR annotation results from the 3 methods and identify which annotations were unique to each method.To reintroduce gene models from the BRAKER annotation, gene predictions that overlapped at least 90% of the boundaries of a complete NLR (CNLs and TNLs) were retained and included in the primary genome annotation.

Sequencing
Previously developed sequencing methods to analyze other conifers (Scott et al. 2020) were used to generate a combination of short-read (Illumina) and long-read (ONT) sequencing data in whitebark pine.This fusion of technologies brings together the advantages of both approaches: leveraging long nanopore reads to span repetitive sequences commonly found in conifers producing a highly contiguous genome assembly (Fig. 1).Although the error rate of nanopore sequencing is steadily improving, it still poses challenges for the final assembly.By integrating these long reads with highly accurate, albeit shorter, Illumina reads, a more precise assembly was produced while maintaining a high level of contiguity.
First, short-read Illumina sequencing data were generated from DNA of a megagametophyte.The haploid megagametophyte DNA precludes the typical difficulties associated with diploid DNA and natural genetic variation between alleles.From this DNA,

Recovered potenƟal NLRs
v1.0: Gene predicƟon with EASEL Fig. 2. Flow chart for annotation steps.Oval rectangles present the activities (in boldface italic type) and the software tools (underlined).Protein coding annotations v0.9 and v1.0 utilized the same input RNA libraries and alignments via HiSAT2.The first version of the annotation (v0.9) relied primarily on StringTie2 to resolve transcripts and incorporated additional models from high-quality NLRs curated from an independent BRAKER2 run.The second version of the annotation (v1.0) was conducted with EASEL that integrates direct evidence-based evaluations and high-quality ab initio predictions.Both annotations were functionally annotated with EnTAP and evaluated with benchmarks generated by BUSCO and AGAT.
Genome sequence for whitebark pine | 5 ∼2.5 Tb of sequence was generated for an estimated ∼100× coverage (Table 1).
It has previously been found that short-read sequencing, especially in conifers, results in low contiguity as the highly repetitive areas typical to these genomes are impossible to assemble with short reads alone.Complicating this issue, existing long-read sequencing methods require relatively large amounts of DNA and achieving the long-read length precludes the use of PCR.As an alternative, high-molecular-weight genomic DNA from needle tissue was extracted from the same tree (Workman et al. 2018).Using a combination of cryogenic tissue grinding and nuclei extraction, high-quality DNA was obtained, which was then subjected to either long-read (N50 20 kb, 571 Gb, ∼23×) or ultralong-read (N50 42.8 kb, 322 Gb, 13×) nanopore sequencing.

Assembly
The MaSuRCA assembler transformed the Illumina reads into super-reads (see Methods and Fig. 1).Table 1 shows that the super-read transformation turned over 18 billion 151-bp Illumina reads into about 163 million super-reads.Half of the sequence in the super-reads was in sequences of 1,228 bp or longer.MaSuRCA then used super-reads to correct the ONT reads by building miniassemblies of overlapping super-reads for each ONT read.These miniassemblies are produced using the ONT reads as templates, and they are called mega-reads.Mega-reads are long and they have a very low error rate, less than 0.5%.The mega-reads algorithm resulted in producing about 24 million mega-reads with an N50 size of 23,140 bp.The MaSuRCA assembly (v0.1) (Table 2) was followed with scaffolding with SAMBA and polishing with POLCA, resulting in assembly v0.9 (Table 2).The v0.9 assembly was then scaffolded with HiRise with OmniC data and super-scaffolded with ALLMAPS using the alignments of markers to the scaffolds and marker positions from the sugar pine map. Figure 3 shows the alignment of the markers from the sugar pine maps to the whitebark pine super-scaffolds.Some discrepancies between the scaffolds and the map were observed in chromosomes 1, 2, 3, 5, 6, and 11.These discrepancies could be due to interchromosomal rearrangements between the sugar pine and whitebark pine genomes.However, they could also be due to misassemblies in the scaffolds of whitebark pine, which cannot be resolved with the currently available data.Scaffolding with ALLMAPS resulted in 24,069,114,767 bp of sequence anchored to the chromosomes of which 23,671,235,725 bp was also oriented.Additional gap closing was then applied to the scaffolds with the SAMBA tool that used original uncorrected ONT reads to fill gaps in the scaffolds.SAMBA closed 1,484 gaps in the assembly, adding 9,065,412 bp of sequence to the assembly.Finally, the JASPER tool was applied to polish the assembly with the Illumina reads.The final polished assembly (v1.0) (Table 2) has an error rate of less than 1 error in 100,000 bases, and it contains 27,605,955,854 bp of sequences in 34,176 scaffolds with N50 contig size of 537,007 bp.Approximately 87.2% (24,072,309,274 bp) of the total sequence was placed on the 12 chromosomes.

Identifying and masking repetitive regions
Prior to the alignment of the transcriptomic short reads, repeat identification with RepeatModeler generated a custom library of 2,576 unique repeat sequences, of which 558 could be classified.This repeat library was used with RepeatMasker to softmask 77.6% of the genome sequence (Fig. 2; Table 3 and Supplementary  Table 4).The overall repetitive content was comparable to Pinus taeda at 74% and P. lambertiana at 79% (Stevens et al. 2016).The majority of the repetitive elements were LTRs, which comprised almost 42% of the genome, and roughly 32% of the genome was unclassified repetitive sequences (Supplementary Table 4).The high proportion of unclassified elements is likely due to RepeatModeler being unable to classify many of the repeats in the generated custom repeat library that was used to mask the genome, as LTRs comprised only 55% of the repeat content in whitebark pine where they usually contribute around 70% of the TE content in conifers (De La Torre et al. 2014;Stevens et al. 2016;Fujino et al. 2023).

RNA sequence data for annotation and transcriptome assembly
A total of 12 Illumina RNA-seq libraries (Supplementary Table 2) were mapped to the whitebark pine reference v0.9 genome following quality control.The final set of selected Illumina libraries ranged from 23.9 to 66.9 M reads and aligned well to the reference (94.8-96%).These alignments were used with BRAKER and StringTie2 to generate the draft genome assembly (Kovaka et al. 2019;Brůna et al. 2021).Two RNA-seq libraries (SRR13823648 and SRR13823649) generated from megagametophyte tissue were used for a de novo transcriptome assembly that was utilized for NLR annotation (Supplementary Table 2).This transcriptome assembly consisted of 37,586 transcripts and had a BUSCO (embryophyta) completeness of 92.6% (S: 88.8%; D: 3.8%).
To improve upon challenges associated with short-read alignment against the complex and repetitive conifer genome, the de novo-assembled transcripts resulting from an independent transcriptomic sampling were aligned at 71% to the genome with Minimap2 (Supplementary Table 3).These transcripts were then used as long-read input for a hybrid long-and short-read transcriptome assembly using StringTie2.The hybrid run of StringTie2 generated a total of 62,936 transcripts (BUSCO: C: 71.4% [S: 37.4%; D: 34.0%]; N50 1,807 bp).These gene models were filtered via Transdecoder/eggNOG, resulting in a total of 45,380 transcripts (BUSCO: C: 70.4% [S: 45.7%; D: 24.7%]; N50 1,515 bp; longest intron: 1.02 Mb; Supplementary Table 5).
As an additional metric for completeness, the de novo-assembled transcripts were aligned to the reference genome independently, resulting in a total of 66,233 unique alignments (BUSCO: C: 88.50% [S: 46.60%; D: 41.90%]; N50 2,217 bp; Supplementary Table 5).These alignments represent variation and gaps and do not directly translate to viable protein-coding models but can provide a benchmark for completeness.

Filtering the v0.9 genome annotation
The Transdecoder/eggNOG-filtered StringTie2 short-read predictions were selected as the best overall annotation.This annotation was further refined by removing transcripts without an EnTAP similarity search or eggNOG annotation that also lacked any protein domains identified using InterProScan, reducing the annotation by 683 genes.An additional 27 complete NLR genes identified from the genome using NLR-Annotator, and overlapping a gene model generated by BRAKER, were added to the annotation (Supplementary Table 6).This final set consisted of 27,010 genes and represented a total of 47,911 transcripts.The annotation had a BUSCO completeness of 70.6% (S: 43.2%; D: 27.4%) and an EnTAP similarity search annotation rate of 86.5%, and the longest intron recorded was 1.39 Mb in length (Table 3).The annotated gene space of whitebark pine is larger and more representative than those of P. lambertiana and P. taeda, which contained 13,936 and 9,024 high-confidence genes with BUSCO completeness of 53 and 30%, respectively (Stevens et al. 2016).The genome annotations of the spruce (Picea) species range from 35 to 49% completeness (Gagalova et al. 2022).More recent conifer genome assemblies report higher BUSCO completeness, such as Sequoia sempervirens at 65.5% completeness (Neale et al. 2022), Pinus tabuliformis at 84% (Niu et al. 2022), and Cryptomeria japonica at 91.4% (Fujino et al. 2023).
Compared to the 70.6% BUSCO completeness of the genome annotation, at the genome level, the whitebark pine genome accounted for only 55.3% BUSCO completeness using the embryophyta lineage, likely due to challenges associated with the predictions across long introns as well as the abundant pseudogenes and high repeat content (Table 3).This result is typical of conifer genomes and, despite this, the whitebark pine genome BUSCO completeness was slightly higher compared to that of several other recently assembled conifer genomes (sugar pine, spruce, coast redwood, and Chinese pine; Stevens et al. 2016;Gagalova et al. 2022;Neale et al. 2022;Niu et al. 2022, respectively).

Annotation of the v1.0 genome
A total of 27,555 genes and 58,831 transcripts were identified using the EASEL pipeline (Webster et al. 2023).The BUSCO completeness was slightly improved at 73.9% (S: 21.7%; D: 52.2%); the ratio of monoexonic to multiexonic genes was 0.33; and the longest intron was 2.45 Mb in length, nearly double the longest intron identified in the v0.9 annotation (Table 3).

NLR identification on the v0.9 genome
NLRs are a major class of disease resistance genes that recognize specific virulence factors.They have a characteristic domain structure with one of 3 canonical N-terminal domains, a nucleotidebinding domain, and a leucine-rich repeat domain.NLRs can be divided into subfamilies based on their N-terminal domain; TNLs contain a TIR domain, CNLs contain a CC domain, and RNLs contain an RPW8 domain (Van Ghelder et al. 2019).The combination of the 3 software methods used to identify NLRs from v0.9 of the genome, before scaffolding (InterProScan, RGAugury, and NLR-Annotator), was necessary to fully describe all 3 types of NLRs, as can be seen from the overlap between complete NLRs identified from the genome annotation by each method or lack thereof (Fig. 4a).The 3 methods were able to independently identify the majority of TNL-type NLRs.Thirty-four TNLs were identified by all 3 methods, 11 were identified by domain scanning methods only, 11 were unique to NLR-Annotator, and 1 was unique to InterProScan.Here, the domain scanning methods performed equally well, and the results of NLR-Annotator were a useful addition to the set of complete NLRs.For identifying RNLs, InterProScan was necessary as the other 2 programs cannot identify the RPW8 domain and would otherwise identify these as NLRs missing an N-terminal domain.For CNLs, RGAugury was necessary as NLR-Annotator and InterProScan were not as effective in identifying CNLs; 7 complete CNLs were identified using RGAugury only.Integration of the gene models from BRAKER that overlapped with complete NLRs from genome scanning was the greatest contributor to the CNL subfamily, contributing 21 complete CNLs out of a total of 33.The ratio of TNLs, CNLs, and RNLs is as expected in conifers, with TNLs being the largest class followed by CNLs, and RNLs being the smallest class (Van Ghelder et al. 2019).RNLs are also more abundant in conifers and some members of the Rosaceae compared to most land plants, which typically have 10 or fewer RNLs (Van Ghelder et al. 2019).Without recovering complete CNL gene models from the ab initio BRAKER gene predictions of whitebark pine genes, there would have been double the number of complete RNLs identified compared to the number of CNLs.Based on NLR identifications in other conifers, the number of CNLs is generally 2 to 3 times as many as the number of RNLs (Van Ghelder et al. 2019).In the giant sequoia (Sequoiadendron giganteum) genome, there were 53 complete CNLs and 17 complete RNLs identified (Scott et al. 2020), following the expected ratio.In

Conclusion
This paper reports the first important step in developing genomic technologies that can be employed to more efficiently and rapidly identify genetic resources that can be used in the restoration of the threatened whitebark pine: a well-assembled and annotated reference genome sequence.The core research team of this project has previously generated reference genome sequences for 5 other conifer species (P.taeda, Neale et al. 2014;Zimin et al. 2014;Zimin et al. 2017b;P. lambertiana, Stevens et al. 2016;Crepeau et al. 2017;Pseudotsuga menziesii, Neale et al. 2017;S. giganteum, Scott et al. 2020;and S. sempervirens, Neale et al. 2022).In all these cases, DNA from a single tree was used to generate the reference genome sequence.Table 4 presents the quantitative statistics of these 5 conifer genomes along with those of 4 other recently published large conifer genomes.Comparison of genome size, gene number, and genome annotations among these genomes with that from whitebark pine reflects very strong similarity in gene and repetitive DNA content.However, at the phenotypic level, these conifers are quite different from each other in many ways (anatomy, morphology, life history, reproductive traits, adaptative traits, disease and insect susceptibility/resistance, etc.).These large phenotypic differences must be due in a large part to allelic variation among a common set of genes and the expression of these genes.Thus, research must now begin in whitebark pine to discover population-level allelic variation and variation in gene expression.As our team has done for our other conifer genome projects, we will now embark on genome-wide association studies and environmental association studies to discover natural variation and investigate its relationship with the vast amount of phenotypic and adaptive variation in populations of whitebark pine.Discovery from studies of this nature will lead to the development of applied genomic screening tools to be used in restoration programs.

Fig. 1 .
Fig. 1.Flow chart for sequencing and assembly steps for the whitebark pine genome.The center row presents the sequence of activities (in boldface italic type) and software tools (underlined).The top and bottom rows describe the starting tissues, sequencing platforms, and sequence read and linkage map inputs and the thin arrows indicate where in the assembly process these inputs entered.The intermediate whitebark pine assembly (v0.9) emerges at the second step in the middle row, while the final assembly (v1.0) emerges at the end step of the middle row.WBP, whitebark pine.Photo credits: Sugar pine inset photograph by Mitch Barre via Wikimedia under Creative Commons Atribution-Share Alike 2.0 Generic license; Whitebark pine needles inset photograph by co-author Patrick McGuire.

Fig. 3 .
Fig. 3. Alignment of the sugar pine linkage map markers to the whitebark pine super-scaffolds.The individual chromosome plots are produced by the ALLMAPS software.The vertical bars in the middle of each of the 12 panels represent the chromosomes.The individual scaffolds of a chromosome are indicated in white or gray shading within those vertical bars.The 2 linkage maps are shown alongside each chromosome representation with marker alignments indicated with fine lines from the central chromosome representation to the linkage maps.

Fig. 4 .
Fig. 4. Results of NLR annotation methods.a) Within the genome annotation, complete NLRs identified by each method and annotations with support from multiple methods.In each cluster, the upper-left circle (yellow) represents NLRs identified only using InterProScan; the upper-right circle (coral/red) represents NLRs identified using only RGAugury; and the lower circle (pink) represents NLRs identified using only NLR-Annotator and supported by the genome annotation.b) NLRs identified by input type: a de novo-assembled transcriptome, the genome sequence, and the genome annotation.In each bar, the top rectangle represents the number of complete NLRs; the second-from-the-top rectangle represents the number of NLRs missing an LRR domain; the second-from-the-bottom rectangle represents the number of NLRs missing an N-terminal domain; and the bottom rectangle represents the number of NLRs identified only by the NB-ARC domain.c) Breakdown of total classified NLRs in the genome annotation with the addition of genes recovered from BRAKER and their contribution to the NLR classes.From left to right, the bars represent the TNL, CNL, and RNL classes of NLRs.For each class (bar), the top rectangle (gray) represents the number of complete NLRs; the next rectangle down (orange) represents the partial NLRs; and the bottom rectangle (blue, missing from the RNL bar) represents the NLRs recovered from BRAKER.

Table 1 .
Quantitative statistics of the initial sequencing data and intermediate processed reads.

Table 2 .
Quantitative statistics of the intermediate and final assembly steps.

Table 3 .
Statistics on the structural annotation of the whitebark pine reference genome assembly.

Table 4 .
Quantitative statistics of recently published large conifer genomes.