Genomic sequencing and microsatellite marker development for Boswellia papyrifera, an economically important but threatened tree native to dry tropical forests

The world famous aromatic resin Frankincense is tapped from natural populations of Boswellia trees. Most of these populations have been shrinking rapidly over recent decades. To help guide conservation efforts for imperilled species of this genus, we developed 46 genetic markers for Boswellia papyrifera. Several of these were cross-transferable to other Boswellia species that occur in Ethiopia and Yemen. We also identified genes involved in the biosynthesis of the terpenes and terpenoids that are major constituents of frankincense.


Introduction
To implement an effective conservation programme, it is essential to understand the genetic structure of endangered populations and the dynamics of genetic variation over space and time (Karp et al. 1997;Burczyk et al. 2006;Gonzá lez-Martínez et al. 2006;Frankham et al. 2010;Nybom et al. 2014). Microsatellite or simple sequence repeat (SSR) markers have been widely applied in quantifying the level of genetic variation and its spatial organization, describing the demography and history of populations, and analysing the gene flow and parentage in plants and animals (e.g. Arens et al. 2007;Smulders et al. 2008;Primmer 2009;Allan and Max 2010). These repeats are abundant in the genome, polymorphic and multiallelic (thus highly informative), have co-dominant inheritance (allowing a direct measurement of heterozygosity), and markers based on them are frequently transferable across related species (Chase et al. 1996;Smulders et al. 1997Smulders et al. , 2001Brondani et al. 1998;Pastorelli et al. 2003;Tuskan et al. 2004;Selkoe and Toone 2006;Allan and Max 2010;Fan et al. 2013).
Recently, next-generation sequencing technologies have simplified generating large amounts of sequences at affordable cost, thus facilitating the development of molecular markers, including SSRs and single-nucleotide polymorphisms (SNPs) Ekblom and Galindo 2011;Castoe et al. 2012;Smulders et al. 2012;Lance et al. 2013;Vukosavljev et al. 2015), as well as chloroplast sequences for phylogeographical studies ( Van der Merwe et al. 2014). The development of markers has thus become feasible also for species for which no prior sequence information exists (Smulders et al. 2012), including understudied but economically important crops (Zalapa et al. 2012).
Marker development can be based on short-length sequences from genomic DNA sequences or cDNA (RNAseq). Both sets of reads are useful, but they differ with regard to further data mining. RNA-seq data can be de novo assembled into a (partial) transcriptome (Yang and Smith 2013) with some caveats, partly related to the assembler used . A common denominator appears to be that multiple assemblers need to be compared (Nakasugi et al. 2014), but the final result can be compared with the transcriptome of other species. In contrast, it is not straightforward (Vicedomini et al. 2013) to assess the quality of a de novo assembly of short reads of genomic DNA from a species for which no prior sequence information is available, especially if the genome is large and contains many repeats, and the species is heterozygous or even polyploid. Nevertheless, many studies are based on genomic DNA, as it is easier to extract DNA from dry material of wild species collected in the field (on silica gel) than to try to extract good quality RNA from fresh samples or from samples specifically prepared for RNA extraction. What additional information can be reliably extracted from a single library of short reads of genomic DNA is an open question.
Boswellia papyrifera is currently the number one frankincense-producing tree species in the world (Coppen 2005). Frankincense is an aromatic resinous gum exudate produced from the bark of trees. Its economic value in the world market stems from its use as an ingredient in pharmaceuticals, cosmetics and as a church incense (Groom 1981;Tucker 1986;Lemenih and Teketay 2003). In Ethiopia, besides its value in the national economy, it has a significant contribution in the local livelihoods, providing up to one-third of annual household income, especially in the northern regions of the country (Lemenih et al. , 2007Woldeamanuel 2011).
The population size of B. papyrifera is declining in Ethiopia (Abiyu et al. 2010;Groenendijk et al. 2012;Tolera et al. 2013), Eritrea (Ogbazghi et al. 2006) and Sudan (Abtew et al. 2012). Little or no tree regeneration occurs in its natural range and mortality of adult trees increases. Despite its endangered status and economic importance, very few conservation efforts exist and none are supported by genetic information. The later situation results because genetic markers for the species have not been developed.
In the present study, we applied the Illumina pairedend sequencing technology to sequence genomic DNA of B. papyrifera with the goal of identifying microsatellite repeats and developing SSR markers. The reads were also assembled into the first genomic resource for this species, and we present a couple of structural and functional analyses on them.

Plant material
Boswellia papyrifera is one of the six Boswellia species that grow in various parts of Ethiopia. The B. papyrifera genotype used for Illumina paired-end sequencing was collected from a natural population at Kafa Humera Wuhdet (14.05265N latitude; 37.13078E longitude) in northwest Ethiopia. Young leaves were collected from growing shoot tips of the plant and preserved in silica gel while in the field and during transportation to the laboratory for DNA extraction. A genomic DNA library for Illumina paired-end sequencing was prepared from 4 mg of DNA following the PCR-based gel-free illumina TruSeq DNA sample prep protocol and sequenced as 2 × 100 nt paired-end reads on an Illumina HiSeq at Greenomics, Wageningen UR, Wageningen, the Netherlands.

Plant material for SSR marker development
For testing of the SSR loci a set of 12 genotypes were used. Ten of the genotypes represented populations of B. papyrifera collected from 10 different regions of Ethiopia. Two genotypes of Boswellia pirrotae and B. popovina were included for testing the cross-transferability of the markers to closely related species. The B. pirrotae sample was from the northwestern part of Ethiopia. Boswellia popoviana is endemic to Socotra Island, Yemen, and the dried leaf sample was obtained through the Edinburgh Royal Botanical Garden, UK.

DNA extraction
Total DNA was isolated from silica-dried young leaves following the cetyltrimethylammonium bromide protocol of Fulton et al. (1995). As large amounts of phenolic compounds were expected because of the resin content in the leaves, the protocol was modified by the addition of 2 % pvp-40 to the extraction buffer and 1 % mercaptoethanol to the microprep buffer of Fulton et al. (1995), added immediately before use. The extraction was followed by purification steps using DNeasy (Qiagen, Venlo, The Netherlands) according to Smulders et al. (2010). DNA yield and quality were visually assessed on a 1 % agarose gel.

Sequence filtering
The raw reads were error-corrected using musket (Liu et al. 2012b). This error-corrected set was used for the repeat assembly. Prinseq-lite 0.20.04 (Schmieder and Edwards 2011) was used for quality control and filtering of reads (minimum read length of 50 nt, minimum average base quality of 25, maximum ambiguous nt (N) of 1) after which the data were used for SSR mining. After low complexity trimming (minimum DUST score of 7 for removal of low complexity reads and removal of duplicate reads, also with Prinseq-lite), paired-end reads with overlapping sequences were connected using connecting overlapped pair-end (Liu et al. 2012a) in the full mode. Reads were filtered for chloroplast sequences by mapping the reads against the closest chloroplast genome available, which is one of Citrus sinensis, using bowtie2 (Langmead and Salzberg 2012, settings -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -a).

Repeat analysis
Reads from the highly repeated fraction of the genome were extracted and assembled using RepARK (REPetitive motif detection by Assembly of Repetitive k-mers; Koch et al. 2014). The motifs present in the repetitive contigs were counted and analysed by blastn (e-value 1e-5) against Repbase v19.08 (database of repetitive DNA elements, Jurka et al. 2005).

Assembly and annotation
A de novo draft assembly was created from the filtered reads using SOAPdenovo 2.21 (Li et al. 2010, settings -K 41 -M 3 -d 4). The gaps emerging during the scaffolding process by SOAPdenovo were closed using GapCloser (vs. 1.12). The contigs .1000 bp of the draft assembly were analysed and functionally annotated using Blas-t2GO (Conesa et al. 2005).

SSR mining and design of primers
Five million of the filtered but not assembled reads were analysed with PAL_FINDER 0.02.03 (Castoe et al. 2012) to identify SSRs using slightly adjusted criteria: at least six contiguous repeat units for dinucleotide repeats, four for tri-and tetranucleotide repeats and three for pentaand hexanucleotide repeats (Castoe et al. (2012) used six units for trunicleotide repeats). Following Castoe et al. (2012) the reads with multiple SSR loci were considered a 'compound' repeat if the SSRs had a different repeat motif, but a 'broken' repeat if the SSRs had the same motif. Reverse-complement repeat motifs (e.g. TG and CA) and translated or shifted motifs (e.g. TGG, GTG and GGT) were grouped together, so that there were a total of four unique dinucleotide repeats, 10 unique trinucleotide repeats and so on.
A subset of over 70 000 trinucleotide to hexanucleotide repeat-containing reads was used to further screen potentially amplifiable SSR loci (PALs): loci for which PCR primers could be designed. Primer designing followed the default parameters specified in Primer3 (Rozen and Skaletsky 2000). The reads were then screened for differences in lengths of those sequences that contained these primers (as in Vukosavljev et al. 2015). At these loci the sequenced plant may be heterozygous, thus indicating that the locus is polymorphic. These formed the group of potentially polymorphic loci.

SSR loci amplification and analyses of polymorphism
PCRs were performed in a total volume of 10 mL reaction mixture containing 4 mL 2 ng mL 21 DNA, 5 mL MP mix from Qiagen kit, 0.8 mL (2 mM) universal fluorescent-labelled primer and 0.2 mL mix of the forward and reverse primers. The fluorescent labelling method described in Schuelke (2000) was adapted to label the primers for analyses of the PCR products with a laser detection system. For this the forward primers were labelled with a universal M13 sequence (AA CAGGTATGACCATGA) at the 5 ′ end while the reverse primers were tailed with GTTT at their 5 ′ end according to Brownstein et al. (1996) to reduce stutter bands (both tailing sequences are not shown in the sequences in Table 1). A thermal cycling profile was set at 15 min of initial denaturation at 95 8C, followed by 30 cycles of 30 s denaturation at 94 8C, 45 s annealing at 56 8C and 45 s extension at 72 8C. This was followed by additional eight cycles with 53 8C annealing temperature to facilitate the annealing of the fluorescent dye-labelled M13 primer, and a final extension step of 10 min at 72 8C. After amplification 10 mL water was added. Fluorescently labelled amplicons were resolved on a 4200 or 4300 Licor DNA analyser.

Next-generation sequencing
Genomic DNA of one B. papyrifera individual was sequenced in order to obtain a library to mine for microsatellite repeats. One lane on an Illumina HiSeq produced 143 458 368 raw reads. Based on k-mer counts, the estimated genome size of B. papyrifera was 705 Mb, sequenced at 36× coverage. After error correction and filtering reads  Table 1. Forty-six polymorphic microsatellite markers developed for B. papyrifera and their cross-transferability to B. pirrotae and B. popoviana. 1 A ¼ number of alleles in 10 B. papyrifera genotypes. 2 Ho ¼ observed heterozygosity (a tentative figure, as the 10 individuals are from 10 different populations. 3 Amplification was also tested in one individual of B. pirrotae (Br) and one of B. popoviana (Bv) except where no Bv is indicated. Hom ¼ homozygous and Het ¼ heterozygous, always with products in the same size range as the alleles in B. papyrifera, except where noted that they were out of range. No ampl ¼ no amplification.

Name
Primer sequence (

SSR identification
A search of SSRs in a subset of five million Illumina paired-end reads identified 170 832 reads (3.4 %) containing SSRs. In these reads, a total of 175 607 repeat loci (dinucleotide through hexanucleotide repeats) were identified, which corresponds to one SSR locus per 5.7 kb. Figure 1 shows the frequency of the top-20 repeat motifs. These include all dinucleotide motif repeats (of at least six repeat units long), of which AC and AT repeats were the most abundant. Of the trinucleotide repeats (of at least four repeat units) AAT and AAC were the most frequent, followed by TTC. Excluding the dinucleotide repeats, the remaining 70 415 SSR loci were screened for the presence of sufficient forward and reverse flanking sequences suitable to design primers. This yielded 29 886 (42 %) PALs. Further filtering of these PALs by applying the most stringent criteria aimed at selecting single-copy loci yielded 4071 potentially amplifiable SSR loci.

Polymorphism and amplification of SSR loci
A total of 136 SSR loci (117 randomly picked and 19 loci predicted to be potentially polymorphic as they appeared to have two different alleles in the sequence reads) were tested for amplification and degree of polymorphism in 10 randomly chosen individuals from different populations. Of the 117 randomly picked loci, 82 primer pairs amplified a high-quality PCR product, of which 37 (45 %) were polymorphic with a banding pattern that could be scored clearly (Table 1). Of the 19 primer pairs predicted to be polymorphic, 13 amplified bands of which 9 loci (69 %) were polymorphic, indicating a significantly higher rate of polymorphism (x 2 test, P , 0.005) compared with randomly picked loci. The final set of 46 markers included 30 trinucleotide repeat markers, 4 tetranucleotide repeats, 6 pentanucleotides and 6 hexanucleotide repeats. The number of alleles across the polymorphic loci varied between 2 and 12 with an average value of 4.8 alleles in 10 genotypes. Several of the polymorphic markers with 10 -12 alleles were TTC repeats. The heterozygosity per locus ranged widely from 0.10 to 0.89 (average 0.43). It is possible that, when used in larger populations, these markers will show higher estimates of Ho, and additional alleles may be found.
As shown in Table 1, most of the SSRs successfully amplified in B. pirrotae and B. popoviana (in the latter species the amount of DNA was insufficient to test all markers). Amplification, even if it is in the same size range as the alleles in B. papyrifera, is not proof that the marker is

Sequence assembly and annotation
The Illumina reads are the first genomic resource generated in the genus Boswellia. The repeat fraction was assembled based on k-mer frequency. This produced 49 576 contigs of repeats that were present at least 50× (median length 139 bp, mean length 224 bp, N50 238 bp, maximum length 21 153 bp, total sum ¼ 574 Mbp). Next, 1533 contigs had blastn hits with RepBase, mostly with Copia (639 hits) and Gypsy (523) retrotransposons, alongside EnSpm (114), hAT (72), Satellite (29), TY (23), Harbinger (16), YPrime (14), Helitron (12) and SCTRANSP (3). Intermixed with these elements were hits to the ribosomal RNAs (LSU 56 hits, SSU 41) and also to Caulimoviridae viruses (11). Using all data in a de novo assembly with SOAPdenovo, 444 927 contigs were obtained with a median of 375 bp, a mean contig length of 690 bp, an N50 of 1085 bp and a maximum contig size of 19 236 bp (total sum ¼ 307 Mb genomic DNA sequence). The contigs .1000 bp were blasted against Genbank, and 65 467 were annotated with GO terms (Fig. 2; note that these are overlapping classes).

Terpene biosynthesis genes
Assefa et al. (2012) conducted a biophysical and chemical study on resins of Boswellia species with special emphasis on B. papyrifera. Using the list of identified components, eight contigs of the assembly were found, which represent part of genes of the terpene synthesis pathways, namely pinene synthase, limonene synthase (2×), isoprene synthase (4×) and gamma-terpinene synthase.
We also searched for the enzymes that are involved in terpenoid backbone biosynthesis (according to the Kyoto Encyclopedia of Genes and Genomes pathway   Table 2 lists the enzymes of the mevalonate and non-metavolate (MEP/DOXP) pathways, the two pathways for the synthesis of terpenoid building blocks in plants, which were found among the annotation results. Two of the key enzymes of the MEP pathway, 2-C-methyl-D-erythritol-4-phosphate cytidylyltransferase (EC 2.7.7.60) and 4-(cytidine 5 ′ -diphospho)-2-C-methyl-D-erythritol kinase (EC: 2.7.1.148), were not recognized in the set of scaffolds, but reciprocal tBlastx (at 1e-5) against these enzymes identified in Arabidopsis did reveal hits with, respectively, 3 and 2 contigs.

Discussion
We have developed the first set of 46 SSR markers for B. papyrifera. The markers amplified between 2 and 12 alleles in individuals from 10 different populations across Ethiopia. We based the marker development on DNA sequences from one individual. Most of the markers tested were chosen randomly, but the subset for which we assessed, from the sequence reads, that they probably had two alleles in this individual, gave a significantly higher success rate compared with the randomly chosen ones. This assessment is a technically easy screening step that would improve the efficiency of marker development in an outbreeding species, even if only sequences from one individual have been generated, as is often the case. It is probably not as efficient as a strategy that generates transcriptome sequences from multiple individuals with the specific aim of testing only those loci on gel for which polymorphisms in repeat length exist among the reads obtained from these individuals (Vukosavljev et al. 2015).
The SSR markers were developed based on a set of Illumina paired-end DNA sequence reads from young leaves of a single individual of B. papyrifera. The distribution of these reads indicated a genome size of 705 Mb. This is close to the estimate of 682 Mb for B. serrata, the only Boswellia species listed in the Kew Gardens C-value database.
Mobile elements that are present in multiple copies in the genome were analysed based on sequence homology in k-mers that occurred at high frequency (Koch et al. 2014). We have identified a series of retrotransposons, the most common being Copia and Gypsy elements. As these elements are present in large numbers, our Illumina reads probably were a sufficiently good source to determine the presence and relative frequency of various elements.
We also assembled all reads of our paired-end shortread library and obtained 307 Mb of unique sequences. The quality of this assembly is difficult to assess without other independent sources such as libraries of different insert sizes, and we therefore did not compare the results of various assemblers (as, e.g. Shahin et al. 2012 did) or merged assemblies (Vicedomini et al. 2013). Our resource was searched for genes that are expected to be involved in production of the major compounds of the resin, which in B. papyrifera includes diterpenes, triterpenes and nortriterpenes (Basar 2005;Assefa et al. 2012;Bekana et al. 2014). The contigs of our assembly gave significant hits for most genes of the core terpene and terpenoid pathways. We have not carried out an in-depth analysis of the sequences in these contigs, as extracting the complete Boswellia homologues of these genes would need more bioinformatics steps and independent validation, e.g. by PCR and Sanger sequencing. However, the results indicate that for many genes of interest at least partial sequence information is present. Genetic information is one of the several tools that facilitate the management of populations and support efforts to conserve threatened species (Moran 2002;Edwards et al. 2011). The newly developed SSR markers generated here for B. papyrifera can be applied for characterizing the genetic diversity, population structure and processes within populations, such as pollen and seed dispersal distances, information which may assist in identifying conservation units for the species. A study of the population differentiation of B. papyrifera across Ethiopia using a subset of these SSR markers is ongoing (Addisalem et al., in prep.). The cross-amplification and polymorphism of the SSR markers in the other two Boswellia species, B. pirrotae and B. popoviana, indicate their potential use for genetic studies of these species and possibly also in other Boswellia species. Boswellia popoviana is declining and vulnerable in Yemen.
The sequence data generated form the start of a valuable genomic resource for various applications, including estimating the past and present demographic parameters, phylogenetics and phylogeography. With regard to 'conservation genomics ', McMahon et al. (2014) suggested that genomic sequences are particularly suited to study local adaptation. For most of these applications, genomic sequences need to be generated from several individuals from different populations. This would complement genetic differentiation studies with neutral molecular markers such as SSRs. An exception is the estimation of the the effective population size from SNP density data based on the differences between the alleles at many loci of the heterozygous tree (e.g. Halley et al. 2014).

Conclusions
Based on Illumina paired-end sequences, we have developed a set of polymorphic SSR markers for B. papyrifera and two sister species, which will be useful for studying genetic diversity within and differentiation between Boswellia populations. We also generated the first genomic resource in Boswellia.

Accession Numbers
Accession number in ENA/GenBank for the set of DNA sequences on which the SSR markers were developed: ERS403283.

Sources of Funding
This study was funded by the Netherlands' Fellowship programme (NUFFIC).

Conflicts of Interest Statement
None declared.