-
PDF
- Split View
-
Views
-
Cite
Cite
Theodora Lo, Lauren Coombe, Kristina K Gagalova, Alex Marr, René L Warren, Heather Kirk, Pawan Pandoh, Yongjun Zhao, Richard A Moore, Andrew J Mungall, Carol Ritland, Nathalie Pavy, Steven J M Jones, Joerg Bohlmann, Jean Bousquet, Inanç Birol, Ashley Thomson, Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response, G3 Genes|Genomes|Genetics, Volume 14, Issue 1, January 2024, jkad247, https://doi.org/10.1093/g3journal/jkad247
- Share Icon Share
Abstract
Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.
Introduction
Globally and particularly within the boreal biome, forest health is declining due to the vulnerability of trees to increasing biotic and abiotic stresses associated with climate change (Allen et al. 2010, 2015; Gauthier et al. 2015). While some tree species or populations may have the capacity for rapid adaptation or migration, others will suffer from maladaptation under changing environmental conditions (Aitken et al. 2008; Benomar et al. 2022). One species of interest is black spruce (Picea mariana [B.S.P.] Mill.), as it is both an ecologically and economically important conifer, being one of the most abundant, widely planted tree species in Canada's boreal forests and highly valued for its wood products (Mullin et al. 2011). As a transcontinental species, geographic genetic variation related to phylogeographic history (Jaramillo-Correa et al. 2004; Gérardi et al. 2010; Prunier et al. 2012) and adaptive variation of clinal nature in relation to climate have been reported (Beaulieu et al. 2004; Thomson et al. 2009). Given the pace of climate change at boreal latitudes, significant maladaptation is expected to occur across much of the species’ range (Thomson et al. 2009).
Early studies have played a critical role in advancing our understanding of the genomic basis of adaptive variation in forest trees (Neale and Kremer 2011; Plomion et al. 2016). The sequencing of the first conifer giga-genomes, including Picea glauca (Birol et al. 2013; Warren, Keeling, et al. 2015), Picea abies (Nystedt et al. 2013), Pinus taeda (Neale et al. 2014), Pinus lambertiana (Gonzalez-Ibeas et al. 2016), Picea engelmannii (Gagalova et al. 2022), Picea sitchensis (Gagalova et al. 2022), and more recently, the chromosome-scale assemblies of Sequoiadendron giganteum (Scott et al. 2020), Pinus tabuliformis (Niu et al. 2022), and Sequoia sempervirens (Neale et al. 2022), provided key insights into the different genome structure and evolution of these gymnosperm species. Other significant genomic resources have been developed for spruces, although they are limited to transcriptome-related data for various species (Bousquet et al. 2021). For black spruce, saturated genetic maps, annotated gene, and SNP resources, and a gene-based genotyping chip have been made available (Pavy et al. 2008; Mann et al. 2013; Pavy et al. 2016; Van Ghelder et al. 2019), enabling association genetics and quantitative trait loci studies (Prunier et al. 2011, 2013), as well as the deployment of genomic selection (Lenz et al. 2017). The chloroplast genome sequence of black spruce has also been more recently determined (Lo et al. 2020).
Complete conifer genome sequences have been sparse due to the associated high cost of sequencing and atypical genome sizes which are amongst the largest of all plants—typically around 20 Gb (De La Torre et al. 2014). However, the rapidly decreasing cost of high-throughput sequencing has recently enabled genome sequencing and assembly for many tree species, with more than 30 genome assemblies now available, including the large genomes of 4 economically important North American spruces (Falk et al. 2018; Gagalova et al. 2022). The burgeoning availability of genomic resources has facilitated an increase in the number of genome-wide population genetic studies of conifers, providing knowledge on the genomic architecture of adaptive variation necessary for improved breeding and conservation under climate change (Namroud et al. 2008; Pelgas et al. 2011; Chen et al. 2012; Savolainen et al. 2013; Hornoy et al. 2015; Yeaman et al. 2016; Depardieu et al. 2021; MacLachlan et al. 2021). These genomic resources have also hastened the development and deployment of tree breeding strategies directly tackling stress response in relation to climate change (Isabel et al. 2020; Laverdière et al. 2022).
Given that much of the black spruce genome has remained undeciphered, its sequencing, assembly, and annotation will be a valuable contribution to the community. The new black spruce genomic resources will facilitate future studies into natural genetic variation and trait genomic architecture in relation to local adaptation, thereby enabling improved conservation and breeding strategies under climate change. Here, we present the first reference genome for black spruce, its phylogenetic implications, and analyze selection signatures in relation to adaptation to biotic and abiotic stress.
Materials and methods
DNA extraction and sequencing
An individual tree that provided convenient access to newly flushed bud tissue was selected from a long-term black spruce provenance test in Thunder Bay, Ontario. The selected individual (genotype 40-10-1) represents a provenance native to northwestern Ontario (50° 57′ 39.96′′N, 90° 27′ 20.16′′E; elevation, 741 m). Sampled tissue was immediately flash-frozen in liquid nitrogen and maintained at −80°C until the time of DNA extraction. High molecular weight (HMW) DNA was extracted from the newly flushed needle tissue by Bio S&T (http://www.biost.com/, Montreal, QC, Canada) using the cetyltrimethylammonium bromide method and HMW genomic DNA extraction protocol as detailed in the Chromium Genome Reagent Kits Version 2 User Guide (PN-120229). Assessment of DNA integrity by pulsed-field gel electrophoresis indicated DNA sizes were concentrated at the 20 kbp to 250 kbp range. A total of 60 μg of high-quality purified DNA was sent to Canada's Michael Smith Genome Sciences Centre (https://bcgsc.ca/, Vancouver, BC, Canada) to produce a single library using the 10 × Genomics Chromium system, as previously described (Lo et al. 2020). An additional 4 Illumina-compatible libraries, 2 with estimated fragment sizes of 400 bp and 2 with estimated fragment sizes of 800 bp, were constructed. The resulting 10 × Genomics and Illumina-compatible libraries were sequenced on an Illumina HiSeqX instrument yielding 5 lanes of paired-end 150 bp reads and 4 lanes of paired-end 250 bp reads, respectively (Supplementary Table 1 in Supplementary File 1).
Mitochondrial genome assembly
Given that mitochondrial genome assemblies are available for a number of spruce taxa (Jackman et al. 2015, 2020; Sullivan et al. 2020; Gagalova et al. 2022), the P. mariana mitochondrial genome was also assembled here to be used for downstream analyses. Adapter trimming was performed on all reads using Trimadap vr11 (Li 2014), then assembled with ABySS v2.1.0 (Jackman et al. 2017) at various k-mer sizes (k = 64, 72, 80, 88, 96, 104, 116, 122, 128) and k-mer multiplicity thresholds (kc = 3, 4). The assembly with the highest NG50 (k = 116; kc = 3) was determined by QUAST v5.0.2 (Gurevich et al. 2013) and mitochondrial DNA sequences were extracted based on BWA-MEM alignments of the scaffolds to a reference interior spruce mitochondrial genome (GenBank accessions MK697696–MK697708) (Jackman et al. 2015). These sequences were then error-corrected with Tigmint v1.1.2 (Jackman et al. 2018), then passed to ARCS v1.0.6 (Yeo et al. 2018) and LINKS v1.8.7 (Warren, Yang, et al. 2015) (m = 4-20000; k = 20; l = 10; a = 0.1) for scaffolding.
Genome assembly
Two rounds of read-merging were performed on the adapter-trimmed reads. The first round consisted of cascading Konnector runs (Vandervalk et al. 2015), where the 10 × Genomics and Illumina HiSeq reads were merged using k values ranging from 115–75 and 235–75, respectively, both with a step size of −10. Reads that were unable to be merged in the first round were subjected to a second round of merging with abyss-mergepairs (Jackman et al. 2017). The longer pseudo-reads and any remaining unmerged reads were then assembled using ABySS v2.2.5 (k = 96, 112, 128, 144, 160; kc = 3, 4). The best assemblies, as assessed by abyss-fac (Jackman et al. 2017), were passed to ntJoin v1.0.3 (Coombe et al. 2020) to perform iterative assembly-guided scaffolding runs, each with the following parameters: no_cut = True; k = 32; w = 250; reference_weights =’2’. Following scaffolding, a round of misassembly correction was performed on the resulting assembly using Tigmint v1.1.2 with span = 2. In addition to the post-ntJoin assembly, 10 × Genomics Chromium reads were also passed as input. This was followed by another round of scaffolding using ARCS v1.1.1 and LINKS v1.8.6 (c = 3; l = 3; a = 0.9; z = 3000; s = 90). Introduced gaps were filled with Sealer v2.2.3 (Paulino et al. 2015) (L = 150; P = 10 l; k = 75,85,95,105,115), yielding the final genome assembly.
Identification and annotation of protein-coding sequences
Prior to annotating the genome assembly, repeat regions were masked using RepeatMasker v4.1.1 (Chen 2004). A custom P. mariana repeat library was constructed using both LTR_retriever v2.9.0 (Ou and Jiang 2018) and RepeatModeler v2.0.1 (Flynn et al. 2020), then supplied as input to RepeatMasker to supplement the RepBase v22.08 repeat library (Bao et al. 2015). Subsequently, gene models were identified in the repeat-masked genome assembly using BRAKER v2.1.6 (Brůna et al. 2021) (--softmasking --etpmode), providing both protein sequences and RNA-seq alignments as evidence as well as an AUGUSTUS model that was pretrained with BUSCO v4.1.4 (Stanke et al. 2006, Simão et al. 2015) (--long). Specifically, a total of 3,463,432 unique proteins from OrthoDB v10 Viridiplantae database (Kriventseva et al. 2019), UniProtKB/Swiss-Prot plant entries (The Uniprot Consortium 2019), P. glauca manual annotations (Warren, Keeling, et al. 2015), and high-quality proteins from annotations found in at least 3 of 4 other spruce taxa (Gagalova et al. 2022) selected using a reciprocal best hit (RBH) approach were used (Supplementary Table 2 in Supplementary File 1). Reciprocal BLAST searches were performed between the North American spruce with the most annotations, P. engelmannii, and each of the other 3 North American spruces (Gagalova et al. 2022). Proteins found in at least 2 of the 3 reciprocal BLAST search results were considered common amongst the North American spruces and included as evidence. To compile the RNA evidence, publicly available P. mariana Illumina HiSeq 2 × 100 bp reads derived from seed tissue were obtained (SRA accessions: SRR9595774 and SRR9595777) (Shao et al. 2019) and subjected to adapter trimming as well as quality filtering with fastp v0.23.1 (Chen et al. 2018). By default, reads with at least 40% of bases having a Phred quality score < 15 were filtered out. Contaminant filtering was performed on the remaining reads with BioBloom tools v2.3.3 (Chu et al. 2014), where Bloom filters were created from Aphid, Archaea, Bacteria, Fungi, Protozoa, and Viral reference sequences obtained from RefSeq (O'Leary et al. 2016) (Supplementary Table 3 in Supplementary File 1). Reads that did not have hits to any of the Bloom filters, and thus not identified as contaminants, were aligned to the repeat-masked genome assembly using HISAT2 v2.2.0 (Kim et al. 2015) with --max-intronlen 1000000.
Following BRAKER annotation, complete protein-coding sequences, defined as those with start and stop codons, containing introns ≥10 bp as identified by GAG v2.0.1 (Geib et al. 2018), were functionally annotated using EnTAP v0.10.8 (Hart at al. 2020)—an annotation pipeline that assigns functions based on similarity search hits to user-selected databases as well as EggNOG and InterProScan hits. EnTAP was run in protein mode (--runP) with P. mariana provided as the target species and Aphid, Bacteria, and Fungi as contaminant taxa. Similarity searches were performed against the Gymno 1.0 PLAZA database (Proost et al. 2009), OrthoDB v10 Viridiplantae database, UniprotKB/Swiss-Prot plant entries (The Uniprot Consortium 2019), and Uniref90 database (Suzek et al. 2015). In addition to these, the UniProtKB/TrEMBL database (The Uniprot Consortium 2019) was also included for contamination-screening purposes.
Annotations with similarity search and/or ontology hits were assessed for the presence of Pfam domains found in gag and pol genes. Corresponding annotations were removed on the basis that the internal region of all long-terminal repeat (LTR) retrotransposons consists of these 2 genes and thus, were likely missed by RepeatMasker. Furthermore, as fragmentation of a gene over multiple scaffolds could be common in draft assemblies, fragmented genes were identified and flagged if the following criteria were met: (1) TPM < 1, as determined by SALMON v1.3.0 (Patro et al. 2017) and (2) in at least 2 sets of read pairs, the paired-end reads mapped to genes on different scaffolds (Supplementary Table 4 in Supplementary File 1). All 883 flagged fragments were grouped into genes based on shared read pair mappings and the longest fragment of each group was then selected as the representative gene fragment for downstream comparative genomics analyses. The quality of the final assembly and annotation were assessed using abyss-fac and BUSCO v5.2.2 with the OrthoDB v10 embryophyta dataset in protein mode.
Phylogenetic analyses
Three phylogenetic trees were constructed using P. mariana genome sequences and publicly available nuclear, chloroplast, and mitochondrial genome sequences of 5 other spruce taxa [Norway spruce (P. abies) v1, white spruce (P. glauca) v2, Engelmann spruce (P. engelmannii) v1, Sitka spruce (P. sitchensis) v1, and the tri-hybrid interior spruce (P. glauca x engelmannii x sitchensis) v5], and 2 pines [loblolly pine (P. taeda) v2.01 and sugar pine (P. lambertiana) v1.5] (Coombe et al. 2016; Asaf et al. 2018; Lin et al. 2019a, 2019b) (Supplementary Table 5 in Supplementary File 1), using MashTree v1.2.0 (Katz et al. 2019) in bootstrap mode: mashtree_bootstrap.pl --reps 100 --numcpus 24 *.fa -- --genomeSize [nuclear: 20000000000, chloroplast: 120000, mitochondrion: 5000000] > *.dnd. The resulting trees were visualized with MEGA11 (Tamura et al. 2021) and rooted using the pine species as the outgroup.
Comparative genomics analyses
The use of different workflows can yield different annotations and thus, varying results in downstream analyses (Venkatraman et al. 2021). To account for this, the annotation files for P. mariana and the aforementioned taxa (P. abies v1, P. glauca v2, P. engelmannii v1, P. sitchensis v1, interior spruce v5, P. taeda v2.01, and P. lambertiana v1.5) were subjected to the same filtering steps prior to comparative genomics analyses—only complete genes with lengths ≥ 1 kbp and introns ≥ 10 bp were kept. The longest transcript per gene was supplied to OrthoFinder v2.5.4 (Emms and Kelly 2019) and run with default settings to identify phylogenetic hierarchical orthogroups, referred to as orthogroups hereinafter.
From the OrthoFinder results, one can obtain species-specific orthogroups on which further analysis can be performed. Of particular interest were those specific to P. mariana. RBH analysis was performed to verify that the orthogroups identified as unique to P. mariana did not have any orthologs in other species. MMseqs2 v14.7e284 (Steinegger and Söding 2017) was used to perform reciprocal BLAST searches between P. mariana specific transcripts and high-quality transcripts annotated in each of the 4 North American Picea taxa (Gagalova et al. 2022). Those with RBHs in at least one of the four taxa were removed from analyses pertaining to P. mariana specific orthogroups.
Identifying positively selected protein-coding sequences in P. mariana
We used the ratio of nonsynonymous to synonymous substitutions (ω = dN/dS) to detect signatures of positive selection (Kimura 1977; Kryazhimskiy and Plotkin 2008). Given that synonymous mutations tend to be neutral or nearly neutral, when dN > dS, natural selection is favoring changes in protein-coding sequences more than neutral expectations and thus, the fixed mutations likely provide a fitness advantage (Kimura 1977). For instance, such changes would be beneficial in stressful environmental conditions, such as those imposed by climate fluctuations (Hoffmann and Hercus 2000).
Detecting signals of positive selection based on pairwise estimations of ω is reputed to be a stringent procedure given that positive selection is only detected if the ω averaged over all sites is >1 (Yang and dos Reis 2011). An alternative and more powerful approach is the use of branch-site tests as it allows variation in ω among branches as well as sites and thus, permits detection of positive selection acting on a few amino acids, which is often the case (Zhang 2005). Given a multiple sequence alignment of genes and a species tree with the branches divided into foreground and background (where the foreground branches are those of interest), the parameters of null and alternative models are estimated. The difference between the 2 models pertains to the predefined foreground branches on which positive selection is allowed. A likelihood ratio test (LRT) is then performed to determine if the alternative model fits significantly better than the null model.
Amongst the 6 spruce and 2 pine taxa, a total of 497 single-copy orthogroups were determined by OrthoFinder and used for this analysis. Primary sequence errors can contribute to misalignments and inaccurate branch length estimates, potentially resulting in overestimated ratios of nonsynonymous to synonymous substitution rates (ω = dN/dS) and consequently, false detection of positively selected genes (Schneider et al. 2009; Di Franco et al. 2019). To minimize the effects of such errors, nonhomologous sequences and repeats were trimmed and/or masked in each single-copy orthogroup by PREQUAL v1.02 (Whelan et al. 2018)—a prealignment filtering tool. Due to the presence of annotation errors in 4 orthogroups, these could not be processed by PREQUAL and were excluded from the analysis. Multiple sequence alignments were generated for the remaining 493 orthogroups using the codon-aware aligner MACSE v2.06 (Ranwez et al. 2018), then provided as input with the previously generated MashTree species topology to CODEML, a program implemented in PAML v4.9j (Yang 2007). To test for positive selection in protein-coding sequences along the P. mariana lineage, branch-site models A1 and A were used with the branch leading to P. mariana set as the foreground. LRTs were conducted to compare the null and alternative models (at α = 0.05) by first calculating the LRT test statistic, then obtaining P-values from the cumulative chi-square distribution function with degrees of freedom, df = 1.
Results and discussion
Genome sequencing, assembly, and annotation
Sequencing one library constructed using the 10 × Genomics Chromium system, 2 Illumina-compatible libraries with an estimated fragment size of 400 bp, and another 2 Illumina-compatible libraries with an estimated fragment size of 800 bp yielded 2.2 billion 2 × 150 bp reads, 252 million 2 × 250 bp reads, and 341 million 2 × 250 bp reads, respectively, totaling to a fold coverage of approximately 46× (Supplementary Table 1 in Supplementary File 1). All reads were used in the assembly and initial scaffolding stages.
Various ABySS k and kc parameter combinations were explored, optimizing for the NG50 length metric and the resulting assemblies were merged with subsequent, iterative ntJoin runs, leveraging the contiguity of each individual assembly. Contiguity evaluation of each assembly indicated that the ABySS assembly with k = 176 and kc = 3 had one of the highest NG50 values (7.2 kbp) and thus, was used as the target for ntJoin runs. Many of the remaining kc = 3 assemblies had relatively high NG50s in comparison to those assembled with kc = 4 and thus, were provided as references. The long-range information in linked reads was used for misassembly correction and further scaffolding, ultimately yielding a final assembly consisting of 1,560,767 contigs with a NG50 length of 36.0 kbp and reconstruction of 18.3 Gbp (Table 1).
Statistic . | Value . |
---|---|
Assembly size (Gbp) | 18.3 |
Number of scaffolds (> 500 bp) | 1,549,050 |
Scaffold N50 length (bp) | 45,468 |
Scaffold NG50 length (bp) | 35,958 |
Repeat content (%) | 79.1 |
LTR elements | 49.3 |
Number of protein-coding sequences | 66,332 |
Number of transcripts | 68,738 |
Mean protein-coding sequence length (bp) | 4,165 |
Mean CDS length (bp) | 827 |
Mean exon length (bp) | 256 |
Mean intron length (bp) | 1,612 |
Statistic . | Value . |
---|---|
Assembly size (Gbp) | 18.3 |
Number of scaffolds (> 500 bp) | 1,549,050 |
Scaffold N50 length (bp) | 45,468 |
Scaffold NG50 length (bp) | 35,958 |
Repeat content (%) | 79.1 |
LTR elements | 49.3 |
Number of protein-coding sequences | 66,332 |
Number of transcripts | 68,738 |
Mean protein-coding sequence length (bp) | 4,165 |
Mean CDS length (bp) | 827 |
Mean exon length (bp) | 256 |
Mean intron length (bp) | 1,612 |
Statistic . | Value . |
---|---|
Assembly size (Gbp) | 18.3 |
Number of scaffolds (> 500 bp) | 1,549,050 |
Scaffold N50 length (bp) | 45,468 |
Scaffold NG50 length (bp) | 35,958 |
Repeat content (%) | 79.1 |
LTR elements | 49.3 |
Number of protein-coding sequences | 66,332 |
Number of transcripts | 68,738 |
Mean protein-coding sequence length (bp) | 4,165 |
Mean CDS length (bp) | 827 |
Mean exon length (bp) | 256 |
Mean intron length (bp) | 1,612 |
Statistic . | Value . |
---|---|
Assembly size (Gbp) | 18.3 |
Number of scaffolds (> 500 bp) | 1,549,050 |
Scaffold N50 length (bp) | 45,468 |
Scaffold NG50 length (bp) | 35,958 |
Repeat content (%) | 79.1 |
LTR elements | 49.3 |
Number of protein-coding sequences | 66,332 |
Number of transcripts | 68,738 |
Mean protein-coding sequence length (bp) | 4,165 |
Mean CDS length (bp) | 827 |
Mean exon length (bp) | 256 |
Mean intron length (bp) | 1,612 |
Conifer genomes have a high proportion of repetitive elements, particularly transposable elements (TEs), with the majority of these being LTRs (De La Torre et al. 2014). As the presence of these repetitive sequences can lead to false evidence for annotations, it is standard practice to repeat-mask the genome prior to annotation (Yandell and Ence 2012). By supplying a P. mariana custom repeat library to RepeatMasker, 79.1% of the genome was identified as repetitive sequences whereas 49.3% consisted of LTR elements (Table 2). This is consistent with findings in other spruce genomes, with reported proportions greater than 70% of highly repetitive DNA (Gagalova et al. 2022). The repeat-masked genome was then passed to BRAKER followed by EnTAP for identification of gene models and functional annotation, respectively. A total of 66,332 protein-coding sequences and 68,738 transcripts were annotated. The median number of introns per isoform was 5, with longer genes often yielding transcripts that contain more introns (Fig. 1a). Long introns are characteristic of conifer genomes (De La Torre et al. 2014; Stival Sena et al. 2014). The longest intron reported had a length of 177 kbp, and 600 introns had lengths greater than 50 kbp (Fig. 1b). Evaluation of annotation quality with BUSCO indicated a total of 416 (25.7%) complete BUSCOs in the annotation.

Intron sequences features for all transcript isoforms of each gene annotated in P. mariana. a) Number of introns found in each isoform at various gene length thresholds (>100 bp, 10,000 bp, 25,000 bp, 100,000 bp, and 250,000 bp). Each plot is annotated with the number of isoforms that meet the threshold. b) Distribution of intron lengths extracted from all isoforms.
Repeat content in P. mariana. Proportions of each repeat type were determined using RepeatMasker and LTR_retriever.
Repeat type . | Proportion (%) . |
---|---|
LTR elements | 49.3 |
Gypsy | 30.1 |
Copia | 12.6 |
Unknown | 6.6 |
LINEs | 2.0 |
DNA transposons | 1.0 |
Simple repeats | 0.3 |
Other repeats | 0.3 |
Unclassified | 26.2 |
Total repeat content | 79.1 |
Repeat type . | Proportion (%) . |
---|---|
LTR elements | 49.3 |
Gypsy | 30.1 |
Copia | 12.6 |
Unknown | 6.6 |
LINEs | 2.0 |
DNA transposons | 1.0 |
Simple repeats | 0.3 |
Other repeats | 0.3 |
Unclassified | 26.2 |
Total repeat content | 79.1 |
Repeat content in P. mariana. Proportions of each repeat type were determined using RepeatMasker and LTR_retriever.
Repeat type . | Proportion (%) . |
---|---|
LTR elements | 49.3 |
Gypsy | 30.1 |
Copia | 12.6 |
Unknown | 6.6 |
LINEs | 2.0 |
DNA transposons | 1.0 |
Simple repeats | 0.3 |
Other repeats | 0.3 |
Unclassified | 26.2 |
Total repeat content | 79.1 |
Repeat type . | Proportion (%) . |
---|---|
LTR elements | 49.3 |
Gypsy | 30.1 |
Copia | 12.6 |
Unknown | 6.6 |
LINEs | 2.0 |
DNA transposons | 1.0 |
Simple repeats | 0.3 |
Other repeats | 0.3 |
Unclassified | 26.2 |
Total repeat content | 79.1 |
Compared to the published nuclear genome assemblies and annotations of 5 other spruce and 2 pine species (Supplementary Table 5 in Supplementary File 1), the P. mariana assembly had a lower contiguity and number of complete BUSCOs than most of them (Supplementary Table 6 in Supplementary File 1). However, this was expected due to the different sequencing technologies and coverage of data that was used for the various assemblies. Whereas the more contiguous genomes were assembled using reads with 80–110 fold coverage and, in some cases, benefitted from the use of long reads (Gonzalez-Ibeas et al. 2016; Zimin et al. 2017; Gagalova et al. 2022), P. mariana was assembled with short and linked reads sequenced at 46 × fold coverage. Despite this limited sequencing coverage, the reconstruction of the P. mariana genome is consistent with the genome sizes of other spruce species which were reported to be approximately 20 Gbp (Gagalova et al. 2022). The numbers of annotated protein-coding sequences and transcripts must be interpreted with caution when compared with the number of genes identified in previous spruce genome sequence studies given the different annotation pipelines and downstream filtering methods used to obtain the final annotations. For instance, BRAKER was used for predicting gene models in P. mariana, whereas a combination of AUGUSTUS and EuGene were used for P. abies and the MAKER-P pipeline was used for the other North American spruce species (Nystedt et al. 2013; Warren, Keeling, et al. 2015; Gagalova et al. 2022). Furthermore, the presence of frequent long introns interspacing exons in spruce genes (Nystedt et al. 2013; Stival Sena et al. 2014) coupled with the lower NG50 in the present study compared to Gagalova et al. (2022), can lead to the annotation of partial protein-coding sequences. Nonetheless, the higher number of annotated protein-coding sequences, despite applying filters such as completeness (presence of start and stop codon) and minimum intron length of 10 bp (Supplementary Table 4 in Supplementary File 1), was matched by a higher number of complete BUSCOs (Fig. 2, Supplementary Table 6 in Supplementary File 1). This result is not unexpected given the use of BRAKER, a more recently developed annotation pipeline that has been shown to yield annotations with higher BUSCO completeness scores, sensitivity, and precision compared to MAKER (Vuruputoor et al. 2023).

Assessment of P. mariana annotation quality (in bold face) relative to other Picea and Pinus taxa included in downstream analyses. The sequencing read type used in each genome assembly is indicated in parentheses on the y axis, where “hybrid” refers to the use of both short and long reads. Annotation quality was determined using BUSCO v5 in protein mode.
Phylogenetic analyses
To gain insights on the evolutionary relationships of P. mariana in relation to other spruce species, phylogenetic analysis was performed with 6 spruce and 2 pine species for which complete genome sequences were available (Supplementary Table 5 in Supplementary File 1), using pines as outgroups for these analyses. Phylogenetic trees were also assembled using chloroplast and mitochondrial genome sequences, given their uniparental mode of inheritance in Pinaceae (paternal for the chloroplast genome and maternal for the mitochondrial genome) and different tree topologies as observed in previous studies based on specific chloroplast and mitochondrial regions (Bouillé et al. 2011; Ran et al. 2015; Gagalova et al. 2022). The use of chloroplast, mitochondrial, and biparentally inherited nuclear genomes should thus provide a more complete picture of evolutionary relationships among these 6 spruce species. In the nuclear phylogeny, P. mariana together with the European P. abies, are sister groups to the other North American spruces included in this analysis, namely, P. glauca, P. engelmannii, P. sitchensis and the hybrid interior spruce (Fig. 3a). Whereas P. glauca, P. engelmannii, and P. sitchensis frequently hybridize with each other, giving rise to interior spruce (Gagalova et al. 2022), attempted crosses between these taxa and P. mariana have either failed or had low success rate, indicating higher evolutionary divergence of P. mariana relative to the others (Ran et al. 2006). However, the position of P. mariana relative to P. abies remained ambiguous given the weak support of P. abies as a sister group to P. glauca, P. engelmannii, P. sitchensis, and interior spruce.

MashTree phylogenies from the a) nuclear, b) chloroplast, and c) mitochondrial genomes of P. mariana (in bold face) and related species. MashTree estimates distances based on k-mer sketches of the genomes, then constructs phylogenetic trees using the neighbor-joining method. For all 3 phylogenies, the trees were rooted using pine species as outgroups. Branch lengths are proportional to the number of substitutions per nucleotide position. Values at the nodes indicate bootstrap values.
A similar phylogenetic tree topology was obtained with the chloroplast and mitochondrial genomes, but the positions of P. sitchensis and P. abies conflicted among phylogenies (Fig. 3). In the chloroplast phylogeny (Fig. 3b), the position of European P. abies as a sister group to North American P. glauca, P. engelmannii, P. sitchensis, and interior spruce was weakly supported. However, in the mitochondrial phylogeny (Fig. 3c), there was strong support for P. mariana as a sister group to these other North American species. Furthermore, in the mitochondrial phylogeny and contrary to the chloroplast phylogeny, there was strong support for P. sitchensis as a sister group to P. glauca, P. engelmannii, and interior spruce, with the latter confirmed as a natural tri-hybrid of the former 3 species (Gagalova et al. 2022). This is indicative of incomplete reproductive isolation and reduced interspecific genetic divergence among them. The topology of the chloroplast phylogeny placed P. sitchensis as an outgroup to all other spruce taxa, which conflicts with the 2 other phylogenies and biogeography. This odd positioning had previously been observed and proposed to result from ancient reticulate evolution affecting the P. sitchensis chloroplast genome (Bouillé et al. 2011; Sullivan et al. 2017; Gagalova et al. 2022).
Though weakly supported, the odd positioning of the European P. abies as a sister group to P. glauca and P. engelmannii on the chloroplast phylogeny, and to P. glauca, P. engelmannii, and P. sitchensis on the nuclear phylogenetic tree, might also be indicative of reticulation between P. abies and P. glauca through ancient gene flow (Chen et al. 2010). Such contact could date back to an interglacial period approximately 400,000 years ago when spruce species were dominant in Greenland, in particular P. abies (de Vernal and Hillaire-Marcel 2008). Such a tree topology was not observed on the mitochondrial phylogeny where a strong geographic structure placing P. mariana as the sister group to all other North American spruces was observed and expected from ancient geographical speciation (Bouillé et al. 2011). Indeed, spruce mitochondrial genomes are dispersed by spruce seeds only, which disseminate across much smaller distances than pollen and thus, chloroplast genomes. As such, the mitochondrial tree may be more indicative of vertical descent associated with phylogeographic speciation, while the chloroplast tree would integrate horizontal transfers by pollen though more or less ancient reticulate evolution between already genetically distinct lineages (Gérardi et al. 2010; Bouillé et al. 2011; Sullivan et al. 2017; Gagalova et al. 2022). As for the nuclear phylogeny, given the biparental inheritance of the nuclear genome, it well reflects a blend of both types of evolution.
Comparative genomics
In a recent study, Gagalova et al. (2022) showed that the allopatric and ecologically divergent P. glauca and P. sitchensis had mostly distinct sets of rapidly evolving genes under positive selection, largely related to stress and stimuli response. Although the natural ranges of P. mariana and P. glauca are mostly sympatric, the 2 species have also adapted to distinct climatic and ecological niches. Black spruce is often found on cold, wet, nutrient-poor soils, whereas white spruce tends to inhabit sites with warmer, well-drained soils (Nicklen et al. 2021). As several studies have associated species-specific genes with environmental stress responses and unique traits (Li et al. 2009; Arendsee et al. 2014), further delineating and investigating protein-coding sequences found specific to P. mariana may provide insights on how it has adapted to its rather specific ecological niche among the spruce species analyzed here.
To compare the protein-coding sequence content among the 6 spruce and 2 pine species, an orthogroup analysis was performed using OrthoFinder. In the context of the present taxa sampling, this analysis resulted in the identification of 34,776 orthogroups in total, with 2,869 being shared amongst all species and 560 found specific to P. mariana (Supplementary Table 7 in Supplementary File 1).
Analysis of P. mariana specific orthogroups
Of the 1,900 protein-coding sequences assigned to P. mariana specific orthogroups, 109 sequences had RBHs with genes in at least one of the four North American spruces and thus, were removed from subsequent analyses. On further analysis of the remaining P. mariana specific protein-coding sequences, 792 were assigned functional categories based on orthologous hits in the EggNOG database. Majority of the predicted functions were unknown (Fig. 4a), with many protein-coding sequences containing a domain of unknown function (DUF4283); however, the EamA and F-box domains were also highly prevalent (Fig. 4b, Supplementary Fig. 1 in Supplementary File 1) and these gene families have been partially characterized in Viridiplantae. The EamA gene family is involved in transport (Västermark et al. 2011) and sequences belonging to the F-box superfamily are responsible for controlling diverse biological processes including growth, development, and abiotic stress tolerance (Abd-Hamid et al. 2020; Rao and Virupapuram 2021).

Distribution of a) EggNOG functional categories and b) best matching PFAM domains (n > 5). Functional category annotations were assigned based on hits to entries in the EggNOG database, while PFAM annotations were assigned via hits in EggNOG and/or InterPro databases as part of the EnTAP annotation pipeline. If a protein-coding sequence was assigned to multiple functional categories, each category was counted separately.
Apart from those classified as function unknown, there was a relatively large presence of sequences in the following categories: signal transduction mechanisms; post-translational modification, protein turnover, chaperones; transcription; carbohydrate transport and metabolism; secondary metabolites biosynthesis, transport, and catabolism (Fig. 4a). Among those classified under signal transduction mechanisms, Pkinase domains were particularly prevalent (Fig. 4b, Supplementary Fig. 1 in Supplementary File 1). NB-ARC and LRR 1 domains, both found in NLR proteins which are widely known to play a central role in resilience to biotic and abiotic stresses (Van Ghelder et al. 2019; Ence et al. 2022), were present in high numbers as well. Interestingly, a study aimed at developing a repertoire of conifer NLR genes identified more in P. mariana compared to P. glauca (Van Ghelder et al. 2019), thereby supporting our finding of numerous NB-ARC and LRR 1 domain-containing sequences unique to P. mariana. Additionally, Lectin LegB domains were often found in sequences associated with signal transduction mechanisms. Various studies have demonstrated the critical role that this domain plays in pathogen response (Singh and Zimmerli 2013; Lannoo and Van Damme 2014) and a few identifying links to abiotic stress response and developmental processes (Wan et al. 2008; Li et al. 2014; Zhao et al. 2019).
Domains found in sequences annotated with the post-translational modification, protein turnover, chaperones EggNOG functional category included PG binding 1 and AAA (Fig. 4b, Supplementary Fig. 1 in Supplementary File 1). Many sequences containing the PG binding 1 domain were annotated with an InterPro domain in matrix metalloproteinases (MMPs)—Pept M10 metallopeptidase. Few plant MMPs have been characterized in detail, but a recent study assessing genetic variation among drought resilient Norway spruce trees identified a SNP in a MMP family gene that had significant association with wood density (R2 = 0.25–0.26) (Trujillo-Moya et al. 2018).
Myb Dna-binding domain was the most frequent domain in the transcription EggNOG functional category (Fig. 4b, Supplementary Fig. 1 in Supplementary File 1), accounting for 11 of the 58 sequences assigned to that category. Previously, several R2R3-MYB genes had been identified and characterized in P. mariana and P. glauca (Xue et al. 2003; Bedon et al. 2007). However, given the large size of the R2R3-MYB family, with over 120 genes reported in angiosperms (Riechmann et al. 2000), it is likely that those previously identified make up only a fraction of the R3R3-MYB clade in spruce, let alone those that have yet to be identified in other MYB clades. Following Myb Dna-binding, the Med26 domain, which plays a possible role in transcription elongation (Mathur et al. 2011), and RNA pol Rpb2 2, the domain characteristic of genes encoding the second-largest subunit of DNA-dependent RNA polymerase II, also appeared frequently in the transcription EggNOG functional category.
Protein-coding sequences related to carbohydrate transport and metabolism were often categorized under energy production and conversion. Within these 2 EggNOG functional categories, UDPGT, Alde H, RuBisCO large, and MFS 1 domains were the most represented (Fig. 4b, Supplementary Fig. 1 in Supplementary File 1). The p450 domain was most prevalent in the secondary metabolite biosynthesis, transport, and catabolism EggNOG functional category. These domains are characteristic of cytochrome P450 oxygenases (CYPs) known to play important roles in oleoresin defenses of conifers and in interactions of conifers with insect pests and pathogens (Celedon and Bohlmann 2019; Chiu and Bohlmann 2022). Expansions within CYP subfamilies have been detected (Warren, Keeling, et al. 2015); therefore, the p450 domain-containing P. mariana specific protein-coding sequences may be members of the CYP gene family that have not been annotated in the species included in this analysis.
In all, 999 P. mariana specific protein-coding sequences were not assigned functional categories as there were no orthologous hits in the EggNOG database, but 122 of these sequences were annotated with PFAM domains based on InterPro hits. The most prevalent PFAM domain amongst these was the TCP domain, which has been found in transcription factors regulating plant growth and development (Cubas et al. 1999).
There are various potential factors that could contribute to the presence of these P. mariana specific protein-coding sequences and additional studies are needed to better understand their evolution. Black spruce exhibits unique ecological traits such as being adapted to the large variety of soil types and moisture conditions found in the temperate and boreal forests (Nicklen et al. 2021), as well as being able to maintain itself vegetatively for centuries in the inhospitable growing conditions of the northern open boreal forest and subarctic tundra through layering (Laberge et al. 2000). Thus, it would be interesting to investigate whether these coding sequences are encoded by novel genes related to these traits. It is also possible that some sequences appear species-specific due to sequence divergence leading to failed inference of orthologs and/or unannotated sequences in the other spruce and pine species. For instance, within the MYB gene family, the amino acid sequences outside of the MYB DNA-binding domains are highly divergent (Bedon et al. 2010), which may be challenging for multiple sequence alignment tools and thus, complicate the process of gene tree and orthogroup inference. Furthermore, several highly abundant PFAM domains in P. mariana specific protein-coding sequences were also found to be abundant in unannotated P. lambertiana transcripts, namely DUF4283, PKinase, NB-ARC, WD40, and LRR 1 (Gonzalez-Ibeas et al. 2016).
Protein-coding sequences under positive selection
To identify positively selected protein-coding sequences in P. mariana, branch-site tests were conducted on the 497 single-copy orthogroups identified between all 8 conifer taxa considered in this study. Of these, 45 (9.1%) were identified as positively selected (Supplementary Table 8), 38 of which could be annotated with one or more functional categories based on matches with known proteins in the Araport11 database (Cheng et al. 2017) and/or literature search highlighting a diverse set of molecular functions (Supplementary Table 9). The most represented functional category was development (12 sequences), followed by response to stress (8 sequences) and growth (7 sequences) (Fig. 5). Among them, several protein-coding sequences appear essential either for plant survival or plant adaptation to environmental conditions as illustrated by the following examples.

Functional classes of the 45 positively selected protein-coding sequences. In total, 38 sequences had a match with an Arabidopsis protein from the Araport11 database (blastp evalue < e−20) and could be associated to a functional class based on homology searches and/or literature searches. If a protein-coding sequence was assigned to multiple functional classes, each class was counted separately. “Not assigned” refers to sequences without hits in the Araport11 database, whereas “uncharacterized” refers to those that had matches but the functions of those Arabidopsis proteins have not been fully characterized yet.
Twelve protein-coding sequences were found to be involved in development, 7 of which were only assigned to that category. These include 2 sequences whose homologs regulate flowering time in Arabidopsis: the developmental protein FRIGIDA (Michaels et al. 2004), also involved in temperature adaptation (Tabas-Madrid et al. 2018), and the serine/threonine-protein kinase WNK1 that regulates flowering time by modulating the photoperiod pathway (Wang et al. 2008). Furthermore, there was an aspartate/glutamate/uridylate kinase family protein that has been identified to be a trihelix/aa-kinase chimera associated with leaf development (Kuromori et al. 2006; Kaplan-Levy et al. 2012). The other 4 sequences included an ARID domain-containing protein which is a seed-specific transcription factor in Arabidopsis (Zheng et al. 2014), a NAC transcription factor involved in xylem formation (Endo et al. 2015), ARF19 which is responsible for regulating various auxin-mediated development processes including lateral root formation (Okushima et al. 2005, 2007) and TMKL1, whose function has yet to be fully characterized but has been suggested to mediate vascular tissue development (Wu et al. 2016).
The remaining 5 of 12 sequences involved in development were also associated with other functional classes. One protein-coding sequence plays a role in RNA processing as well as development—RRD1 participates in mitochondrial mRNA deadenylation, which is fundamental in controlling early lateral root organogenesis (Otsuka et al. 2021). There were also 2 protein-coding sequences involved in both growth and development. One is an ankyrin repeat-containing protein essential for chloroplast biogenesis (Shen et al. 2010), while the other is a member of the IQ67-domain (IQD) protein family known to regulate plant growth, lateral organ polarity, and basal defense response against environmental cues (Bürstenbinder et al. 2017; Barda and Levy 2022). The last 2 sequences are involved in stress response and development: a development and cell death (DCD) domain-containing protein (Tenhaken et al. 2005) and AAC2, which encodes a mitochondrial ADP/ATP carrier that has been suggested to play a role in the mechanisms of ABA-mediated stress response (Kharenko et al. 2011).
A total of 8 sequences encoded proteins related to stress response. Three are involved in plant immunity or plant immunity regulation: a member of the pleiotropic drug resistance family, the PBL27 serine–threonine kinase involved in a cascade that leads to chitin-induced immunity (Kawasaki et al. 2017) and a homolog of the AtWRKY3 transcription factor regulating the expression of the pathogen induced gene NPR1 (Yu et al. 2001). PLIP2 is a glycerolipid A1 lipase that provides tolerance to various abiotic stresses, including cold, by ABA-mediated synthesis of jasmonate and oxolipins (Wang et al. 2018). Four of the 8 stress-related protein-coding sequences were assigned to another functional category along with stress response, 2 of which have been mentioned previously—a DCD domain-containing protein and AAC2, both contributing to plant development as well. Besides those, a sequence encoding DGD1 was additionally annotated with metabolism as it is responsible for the synthesis of DGDG, a glycolipid critical for the stabilization of chloroplast membranes, thereby conferring thermotolerance (Kobayashi et al. 2014). APUM8 was also annotated with RNA-processing and has been implicated in salt stress tolerance (Huang et al. 2018).
Five protein-coding sequences had direct or indirect roles in growth, cell elongation, and plant height in Arabidopsis: a filament-like protein (Chen et al. 2016), an actin-binding FH2 (formin 2) family protein (Vidali et al. 2009), DRACULA2 (Gallemí et al. 2016), CNGC8 which has been shown to be essential for pollen tube tip growth (Tunc-Ozdemir et al. 2013), and KORRIGAN2 (Mølhøj et al. 2001) whose homolog is also associated to early growth in pine (Cabezas et al. 2015).
The functional categories of the high dN/dS protein-coding sequences found in the present study overlap well with those reported by Buschiazzo et al. (2012) and De La Torre et al. (2015), both of which used estimations of dN/dS to detect positive selection. Eight gene families implicating high dN/dS values were found in common between the present study and that of De La Torre et al. (2015), including auxin responsive protein, eukaryotic aspartyl protease, alkaline and neutral invertase, ABC transporter, cyclic nucleotide gated channel, FRIGIDA-like proteins, ankyrin repeat-containing protein, and the O-Glycosyl hydrolases family 17. However, the specific protein-coding sequences with high dN/dS values were different between studies, including the fact that here, we considered only single-copy genes in the dN/dS analysis. These studies should be viewed as complementary, 2 being based on transcriptome data and our study on genomic data. Furthermore, there was a large overlap between the abundant functional categories found in this study and those presented in Gagalova et al. (2022), where the most rapidly evolving genes shared by P. sitchensis and P. glauca were identified by estimating the ratios of nonsynonymous to synonymous SNPs (SNP A/S) for each species.
Conclusions
We produced the first nuclear genome assembly of P. mariana with a NG50 length of 36.0 kbp and reconstruction of 18.3 Gbp. This is comparable to the estimated haploid genome size reported to be around 17.5 Gbp based on the measured C-value of 17.4 pg (Bai et al. 2012; Mann et al. 2013). This estimate indicates that the large size of the spruce nuclear genome is a feature shared by multiple lineages. Hence, given the average divergence time in the scale of 10 to 20 Myr between major spruce lineages leading to P. mariana, P. abies, and P. glauca (Bouillé and Bousquet 2005), our results support the finding that much of the spruce genome expansion would thus well predate this era and be shared by common ancestry (Nystedt et al. 2013).
Substantial genetic differentiation was detected between the black spruce nuclear and organellar genomes and those of other North American spruce species, reflecting the reproductive isolation of black spruce from the others and the more ancient divergence of its lineage. At the same time, signatures of at least 2 occurrences of ancient reticulate evolution of the spruce chloroplast and nuclear genomes were detected. These support evidence of incomplete lineage sorting (Bouillé and Bousquet 2005) and slow speciation characterized by delayed establishment of reproductive isolation influenced by the large effective population sizes and airborne dispersion of pollen over potentially long distances in these species (Bousquet et al. 2021).
Analysis of specific biological functions and positive selection revealed an abundance of genes implicated in development and stress responses, whether for biotic or abiotic factors. This is congruent with the often-harsh environments or climates characterizing the ecological niches occupied by spruce species, particularly black spruce as it is found mostly in the boreal forests of North America. This pattern is largely shared with observations from previous studies of molecular adaptation in spruces (e.g. Hornoy et al. 2015; Yeaman et al. 2016; Depardieu et al. 2021; Gagalova et al. 2022), although taxon-specific genes under positive selection appear much more frequent than shared ones, at least between spruces (Gagalova et al. 2022). This is likely related to the existence of large gene families and functional redundancy potentially leading to subfunctionalization or neofunctionalization (Warren, Keeling, et al. 2015; Stival Sena et al. 2018; Van Ghelder et al. 2019; De La Torre et al. 2020), which is likely beneficial in developing multiple molecular strategies for stimuli and stress response.
With black spruce predicted to decline due to increasing temperature across its natural range (Thomson et al. 2009), and the increasing frequency of extreme weather events such as late cold spells and drought episodes under mid-northern latitudes (Benomar et al. 2022; Laverdière et al. 2022), much remains to be understood about the specific mechanisms and molecular basis of genetic adaptation in this ecologically and economically important species. We expect that the assembly and annotation of the black spruce nuclear genome will aid the forest research community in gaining a better understanding of the different mechanisms and capacities for local adaptation to climate warming and increasing instability, thus contributing to the development of molecular-assisted breeding strategies and guidelines to orientate assisted migration.
Data availability
Raw sequencing reads are available on NCBI under BioProject PRJNA610902. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JASDQU000000000. The version described in this article is version JASDQU010000000. The genome annotation is available on Zenodo (Lo et al. 2023). The mitochondrial genome assemblies for P. mariana (Marr et al. 2023), P. engelmannii (Coombe et al. 2023a), and P. glauca (Coombe et al. 2023b) are also available on Zenodo.
Supplemental material available at G3 online.
Funding
This project was supported through a Lakehead University research grant led by A. Thomson, the CanSeq150 Initiative led by A. Thomson and S.J.M. Jones, and the Spruce-Up project co-led by J. Bohlmann and J. Bousquet with funding from Genome Canada, Genome British Columbia and Genome Quebec (grant number 243FOR).
Literature cited
Author notes
Conflicts of interest statement The author(s) declare no conflicts of interest.