The genome draft of coconut (Cocos nucifera)

Abstract Coconut palm (Cocos nucifera,2n = 32), a member of genus Cocos and family Arecaceae (Palmaceae), is an important tropical fruit and oil crop. Currently, coconut palm is cultivated in 93 countries, including Central and South America, East and West Africa, Southeast Asia and the Pacific Islands, with a total growth area of more than 12 million hectares [1]. Coconut palm is generally classified into 2 main categories: “Tall” (flowering 8–10 years after planting) and “Dwarf” (flowering 4–6 years after planting), based on morphological characteristics and breeding habits. This Palmae species has a long growth period before reproductive years, which hinders conventional breeding progress. In spite of initial successes, improvements made by conventional breeding have been very slow. In the present study, we obtained de novo sequences of the Cocos nucifera genome: a major genomic resource that could be used to facilitate molecular breeding in Cocos nucifera and accelerate the breeding process in this important crop. A total of 419.67 gigabases (Gb) of raw reads were generated by the Illumina HiSeq 2000 platform using a series of paired-end and mate-pair libraries, covering the predicted Cocos nucifera genome length (2.42 Gb, variety “Hainan Tall”) to an estimated ×173.32 read depth. A total scaffold length of 2.20 Gb was generated (N50 = 418 Kb), representing 90.91% of the genome. The coconut genome was predicted to harbor 28 039 protein-coding genes, which is less than in Phoenix dactylifera (PDK30: 28 889), Phoenix dactylifera (DPV01: 41 660), and Elaeis guineensis (EG5: 34 802). BUSCO evaluation demonstrated that the obtained scaffold sequences covered 90.8% of the coconut genome and that the genome annotation was 74.1% complete. Genome annotation results revealed that 72.75% of the coconut genome consisted of transposable elements, of which long-terminal repeat retrotransposons elements (LTRs) accounted for the largest proportion (92.23%). Comparative analysis of the antiporter gene family and ion channel gene families between C. nucifera and Arabidopsis thaliana indicated that significant gene expansion may have occurred in the coconut involving Na+/H+ antiporter, carnitine/acylcarnitine translocase, potassium-dependent sodium-calcium exchanger, and potassium channel genes. Despite its agronomic importance, C. nucifera is still under-studied. In this report, we present a draft genome of C. nucifera and provide genomic information that will facilitate future functional genomics and molecular-assisted breeding in this crop species.

annotation was 74.1% complete. Genome annotation results revealed that 72.75% of the coconut genome consisted of transposable elements, of which long-terminal repeat retrotransposons elements (LTRs) accounted for the largest proportion (92.23%). Comparative analysis of the antiporter gene family and ion channel gene families between C. nucifera and Arabidopsis thaliana indicated that significant gene expansion may have occurred in the coconut involving Na + /H + antiporter, carnitine/acylcarnitine translocase, potassium-dependent sodium-calcium exchanger, and potassium channel genes. Despite its agronomic importance, C. nucifera is still under-studied. In this report, we present a draft genome of C. nucifera and provide genomic information that will facilitate future functional genomics and molecular-assisted breeding in this crop species.

Data Description
Background Coconut palm (Cocos nucifera, 2n = 32), the only species in the genus Cocos in the family Arecaceae, is a tropical oil crop and widely cultivated in tropical regions due to its extensive application in agriculture and industry. Coconut palm is thought to have originated from the Southwest and Western Pacific region (including the Malay Peninsula and Archipelago, New Guinea, and the Bismarck Archipelago). At present, this tropical tree crop is distributed across 93 tropical countries [2], including Central and South America, East and West Africa, Southeast Asia, and the Pacific Islands, and is grown over 12 million hectares of land [1].
In China, coconut palm grows in the subtropical regions-Hainan and Yunnan provinces-as an economic and ornamental plant. Coconut palm is cultivated over approximately 43 000 hectares in Hainan, with the "Hainan Tall" (HAT) variety covering 36 000 hectares [3]. The HAT coconut needs 8-10 years to enter its reproductive stage and has a height of 20-30 meters, with a medium to large sized nut. The HAT cultivar is highly tolerant to salt and drought stress, but sensitive to temperatures below 10 • C. Coconut palm can disseminate through ocean currents: floating nuts sprout and grow naturally upon washing up on beaches. The ability to adapt to a high-salt environment is closely related to this dissemination feature and to these natural growth conditions. The morphological characteristics of the HAT cultivar are shown in Fig. 1. Here, we present the genome sequence of the Hainan Tall coconut and an analysis of the antiporter and ion channel gene families, relevant to salinity tolerance. As draft genome sequences of coconut relatives (e.g., Elaeis guineensis [4] and Phoenix dactylifera [5,6]) have previously been reported, we also performed a comparative analysis between the coconut and these relative species for genome assembly and annotation characteristics.

Sample collection and sequencing strategy
The genomic DNA was extracted from the spear leaf of an individual of the variety "Hainan Tall" coconut (Cocos nucifera L. Taxonomy ID: 13 894; 19 • 33'3"N, 110 • 47'25"E) from the coconut garden of the Coconut Research Institute (Wenchang, Hainan Province, China) by using the CTAB extraction method [7]. Subsequently, 4 paired-end (PE) libraries with insert sizes of 170 bp, 500 bp, 450 bp, and 800 bp and 5 mate-pair (MP) libraries with insert sizes of 2 Kb, 5 Kb, 10 Kb, 20 Kb, and 40 Kb were constructed using the standard procedure provided by Illumina (San Diego, CA, USA). After library preparation and quality control of the DNA samples, template DNA fragments were hybridized to the surface of the flow cells on an Illumina HiSeq2000 sequencer, amplified to form clusters, and then sequenced by following the standard Illumina manual. Finally, we generated 714.67 Gb of raw reads from all constructed libraries. The raw outputs for each sequenced library are summarized in Table 1. Before assembly, the raw reads were pretreated using the following stringent filtering processes via SOAPfilter (v2.2) [8] software: (1) removed reads with 25% low-quality bases (quality scores ≤7); (2) removed reads with N bases more than 1%; (3) discarded reads with adapter contamination and/or polymerase chain reaction duplicates; (4) removed reads with undersized insert sizes. Finally, 419.08 Gb (estimated 173.17 × read depth) of high-quality sequences were obtained for genome assembly.

De novo assembly of short reads of Cocos nucifera
We used 209.38 Gb of clean reads of the short-insert libraries (excluding the 450-bp library) to estimate the coconut genome size by k-mer frequency distribution analysis [8]. The genome size (G) of Cocos nucifera could be estimated by the following formula: where N represents the total of number of reads, L represents the read length, K represents the k-mer value used in the analysis, and K depth refers to the main peak in the k-mer distribution curve. In our calculations, N was 2 049 520 223, L was 100, and K depth was 71 for K = 17. As a result, the Cocos nucifera genome was estimated to be 2.42 gigabases (Gb). K-mer size distribution analysis (Fig. 2) indicated that Cocos nucifera was a diploid species with low heterozygosity and a high proportion of repetitive sequences. The sequencing depth is shown in parentheses, calculated based on a genome size of 2.42 G. Clean data were obtained by filtering raw data with low-quality and duplicate reads. We then assembled the Cocos nucifera genome using the software SOAPdenovo2 (SOAPdenovo2, RRID:SCR 014986) in 3 steps: contig construction, scaffold construction, and gap filling. In the contig construction step, the SOAPdenovo2 was run with the parameters "pregraph -K 63 -R -d 1" to construct de Bruijn graphs from paired-end libraries with insert sizes ranging from 170 to 800 bp. The k-mers from the de Bruijn graphs were then used to form contiguous sequences (contigs) with the parameters "contig -R" by clipping tips, merging bubbles, and removing low-coverage links. In the scaffold construction step, the orders of the contigs were determined by using paired-end and mate-pair information with parameters "map -k 43" and "scaff -F -u". In more detail, SOAPdenovo2 maps the reads from pairedend and mate-pair libraries to contigs based on a hash table (keys are unique k-mers on contigs; values are positions). In such cases, 2 contigs are considered to be linked if the bridging of the contigs is supported by 5 paired-end read pairs or 3 mate-pair read pairs. In the gap filling step, gaps within scaffolds were filled by utilizing KGF [8] v1.06 and GapCloser v1.12-r6 (GapCloser, RRID:SCR 015026) [8] with paired-end libraries (hav-ing an insert size from 170 to 800 bp in cases, where 1 end could be mapped to 1 contig and the other end extended into a gap). To optimize the assembled sequence, Rabbit (a Poisson-based k-mer model software [9]) was used to remove the redundant sequences. A final length of 2.20 Gb for the scaffolds was obtained and used for further analysis, accounting for 90.91% of the predicted genome size and larger than the African oil palm and date palm genomes (Table 2). Meanwhile, the N50 of the obtained contigs was 72.64 Kb and 418.06 Kb for the scaffolds, which have excluding scaffolds of less than 100 bp. The comparison of N50 values for the assembled coconut genome and for the 4 previously published palm genomes Elaeis guineensis [4], Elaeis oleifera [4], Phoenix dactylifera (PDK30) [5], and Phoenix dactylifera (DPV01) [6] is listed in Table 2.

Genome evaluation
The 57 304 unigenes (transcript obtained from 3 different tissues, spear leaves, young leaves, and fruit flesh), as previously reported by Fan et al. [10], were aligned to the assembled genome of Cocos nucifera using BLAT (BLAT, RRID:SCR 011919) [11] with default parameters. The alignment results indicated that the assembled genome of Cocos nucifera covered 96.78% of the expressed unigenes, suggesting that a high level of coverage has been reached for the assembled genome (Table 3).
We also evaluated the level of genome completeness for the assembled sequences by using BUSCO v2.0 (BUSCO, RRID:SCR 015008) [12], which quantitatively assesses genome completeness using evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9 (OrthoDB, RRID:SCR 011980; plant set) [13]. BUSCO analysis showed that 90.8% and 3.4% of the 1440 expected plant genes were identified as complete and fragmented genes, respectively, while 5.8% of genes were considered to be missing from the assembled coconut genome sequence. The comparative results of the BUSCO estimation in the coconut and in the 4 other palm genome sequences indicates that the smallest fraction of missing genes as predicted by BUSCO was found in the coconut genome assembly (Table 4).

Repeat annotation
We combined homology-based annotation and a de novo method to identify transposable elements (TEs) and the tandem repeats in the Cocos nucifera genome. In the homology-based annotation step, TEs were identified by searching against the Repbase library (v20.04) [14] with RepeatMasker (v4.0.5;

Gene prediction
We combined 3 strategies to predict genes in the Cocos nucifera genome: homology-based, de novo, and transcript alignment. For homology-based annotation, the protein sequences of Arabidopsis thaliana [18], Oryza sativa [19], Sorghum bicolor [20], Zea mays [21], Elaeis guineensis, and Phoenix dactylifera (DPV01) were downloaded from each corresponding source (see "Availability of data sources"). The coconut genome was aligned against these downloaded databases using TBLASTN [22] with parameter "-e 1e-5 -F -m 8" and BLAST results were processed by solar (v0.9) with parameter "-aprot 2 genome2 -z" to determine the candidate gene loci. Next, we extracted the genomic sequences of candidate gene loci, along with 1 kb of flanking sequences, and applied GeneWise 2.2.0 (GeneWise, RRID:SCR 015054) [23] to define the intron-exon boundaries. The genes with pre-stop codon or frame-shifts were excluded from further analysis.
For de novo prediction, we randomly selected 1000 fulllength genes (GeneWise score equal to 100, intact structure: start codon, stop codon, perfect intron-exon boundary) from gene models predicted by homology-based methods to train the model parameters for AUGUSTUS 2.5 (Augustus: Gene Prediction, RRID:SCR 008417) [24]. Two software programs, AUGUSTUS 2.5 and GENSCAN (GENSCAN, RRID:SCR 012902) 1.0 [25], were used to do de novo prediction on the repeat-masked genome of Cocos nucifera. Genes with incomplete structure or a protein coding length of less than 150 bp were filtered out.
Subsequently, genes from both homology-based and de novo methods were combined to obtain non-redundant gene sets by using GLEAN [26] with the following parameters: minimum coding sequence length of 150 bp and maximum intron length of 50 kb. Genes were filtered with the same thresholds as were used for homology-based annotation.
For transcriptome-based prediction, RNA-seq data (SRR606452), as previously reported by Fan et al. [10], were mapped onto the coconut genome to identify the splice junctions using the software TopHat v2.1.1 (TopHat, RRID:SCR 013035) [27]. The software Cufflinks v2.2.1 (Cufflinks, RRID:SCR 014597) [28] was then used to assemble transcripts with the aligned reads. The coding potential of these transcripts was identified using a fifth-order Hidden Markov Model, which was estimated with the same gene sets used in AU-GUSTUS training by train GlimmerHMM, an application in the GlimmerHMM package (GlimmerHMM, RRID:SCR 002654) [29]. The transcripts with intact open reading frames (ORFs) were extracted, and the longest transcript was retrieved as a representative of a gene from multiple transcripts on the same locus.
Finally, we merged the GLEAN and the transcriptome result to form a comprehensive gene set using an in-house annotation pipeline with the following steps: first, all-to-all BLASTP analysis of protein sequences was performed between GLEAN results and transcript assemblies, with an E-value cutoff of 1e-10. These transcript assemblies were added to the GLEAN result to form untranslated region (UTRs) or alternative splicing products, depending on whether the coverage and identity of the alignment results reached 0.9 or not. If the transcript assemblies had no BLAST hit with the GLEAN results, these transcript assemblies were added to the final gene set as a novel gene. The protocol for integrating GLEAN and transcriptome data is shown in Fig. 3.

Gene evaluation
The annotation processes identified 28 039 protein-coding genes ( Table 2), which is less than the predicted gene numbers of Phoenix dactylifera (PDK30, 28 889), Phoenix dactylifera (DPV01, 41 660), and Elaeis guineensis (34 802). Meanwhile, the BUSCO evaluation showed that 74.1% and 11.2% of 1440 expected plant genes were identified as complete and fragmented, with 14.7% of genes considered missing in the gene sets. The BUSCO results showed that our gene prediction was more complete than that of Phoenix dactylifera (PDK30) and Elaeis guineensis, but less complete than that of Phoenix dactylifera (DPV01) ( Table 6).

Gene family construction
Protein sequences of 13 angiosperms, including Elaeis guineensis, Phoenix dactylifera (DPV01), Sorghum bicolor, Prunus persica, Solanum tuberosum, Glycine max, Arabidopsis thaliana, Theobroma cacao, Vitis vinifera, Musa acuminata, Carica papaya, Populus trichocarpa, and Amborella trichopoda, were downloaded from each corresponding ftp site (see "Availability of data sources"). For genes with alternative splicing variants, the longest transcripts were selected to represent the gene. The gene numbers of Elaeis guineensis and Phoenix dactylifera (DPV01) were greatly different from the research paper published in 2013 [4,6], because genes of these 2 species were re-predicted using the NCBI Prokaryotic Genome Annotation Pipeline, which seemed to be more reasonable. Similarities between paired sequences   were calculated using BLASTP with an E-value threshold of 1e-5. OrthoMCL (OrthoMCL DB: Ortholog Groups of Protein Sequences, RRID:SCR 007839) [41] was used to identify gene family based on the similarities of the genes and a Markov Chain Clustering (MCL) with default parameters. About 79.80% of Cocos nucifera genes were assigned to 14 411 families, of which 282 families only existed in Cocos nucifera (coconut specific families) ( Table 7). Fig. 4 shows the shared gene families for orthologous genes. There are 544 orthologous families shared by 5 monocot species and 7706 orthologous families shared by all monocot and dicot species, suggesting 544 monocot unique functions shared by 5 monocot species and 7706 ancestral functions in the most recent common ancestor of the angiosperms.

Phylogenetic analysis
We extracted 247 single-copy orthologous genes derived from the gene family analysis step, and then aligned the protein sequences of each family with MUSCLE (v3.8.31; MUSCLE, RRID:SCR 011812) [42]. Next, the protein alignments were converted to corresponding coding sequences (CDS) using an inhouse Perl script. These coding sequences of each single-copy gene family were concatenated to form 1 super gene for each species. The nucleotides at positions 2 (phase 1 site) and 3 (4 degenerate sites) of codon were extracted separately to construct the phylogenetic tree by PhyML 3.0 (PhyML, RRID:SCR 014629) [43] using a HKY85 substitution model and a gamma distribution across sites. The tree constructed by phase 1 sites was consistent with the tree constructed by 4 degenerate sites.

Divergence time
The Bayesian relaxed molecular clock approach was used to estimate species divergence time using MCMCTREE in PAML (PAML, RRID:SCR 014932) [44], based on the 4 degenerate sites and the data set used in phylogenetic analysis, with previously published calibration times (divergence between Arabidopsis thaliana and Carica papaya was 54-90 Mya, divergence between Arabidopsis thaliana and Populus trichocarpa was 100-120 Mya) [45]. The divergence time between coconut and oil palm is about 46.0 Mya (25.4-83.3 Mya) (Fig. 5), which is less than the divergence time between coconut and date palm.

Identification of antiporter genes in coconut genome
Antiporters are transmembrane proteins involved in the exchange of substances within and outside the membrane. In Arabidopsis, the functions of antiporter genes have been well characterized experimentally, and this gene family was subdivided into 13 different functional groups. Among them, 3 functional clusters were involved in Na + /H + antiporters, some of which were documented to be associated with salt tolerance [46,47]. The amino acid sequences of 70 antiporter genes of Arabidopsis were downloaded from the Arabidopsis Information Resource TAIR website (TAIR, RRID:SCR 004618) [48] and used as queries for BLASTP against the predicted proteins in the Cocos nucifera genome with a cut-off E-value of 1e-10. A total of 126 antiporter genes were identified in coconut genome. Using local Hidden Markov Model-based HMMER (v3.0) searches and the Pfam database, 7 antiporter genes were excluded from further analysis because of the lack of conserved domain. The detailed information of the 119 antiporter genes is listed in Additional file 1.
In order to elucidate the evolutionary relationship and potential functions of the antiporters identified in the study, we applied phylogenetic analysis of Arabidopsis and C. nucifera antiporter proteins using the neighbor joining method (Fig. 6). Phylogenetic analysis showed that the 119 antiporter genes from C. nucifera can be subdivided into 12 groups and that almost all antiporter genes were clustered together with the functional groups in Arabidopsis thaliana.
Phylogenetic analysis showed that the number of antiporter genes was equal between Arabidopsis thaliana and C. nucifera for most groups, except for G1 (1 of 3 Na + /H + antiporter family), G3 (carnitine/acylcarnitine translocase family), and G12 (potassium-dependent sodium-calcium exchanger). The G1 group (1 of 3 Na + /H + antiporter families) contained only 1 Arabidopsis antiporter gene and but 14 C. nucifera antiporters (1-At/14-Cn), whereas G3 (carnitine/acylcarnitine translocase family) contained 1-At/29-Cn, and G13 (potassium-dependent sodium-calcium exchanger) contained 3-At/11-Cn. The Na + /H + antiporter family had been reported to be associated with salt stress. The expansion of the Na + /H + antiporter gene family in the coconut palm may be associated with the high salt tolerance of coconut. Meanwhile, carnitine/acylcarnitine translocase is involved in fatty acid transport across the mitochondrial membranes. This gene family expansion may be associated with accumulation of fatty acid in coconut pulp. Moreover, coconut water contains a high density of potassium ion, approximately 312 mg potassium ion per 100 g of coconut water [49]. In this study, the gene number of potassium-dependent sodiumcalcium exchangers was also detected to be significantly increased compared to Arabidopsis.

Identification of ion channel genes in coconut genome
A total of 67 ion channel genes were identified in the coconut genome (Additional file 2). The amino acid sequences of 67 C. nucifera and 60 Arabidopsis ion channel genes were used to analyze their evolutionary relationship (Fig. 7). Almost all ion channel genes from C. nucifera can be clustered into the function groups found in Arabidopsis thaliana. The number of ion channel genes was equal between Arabidopsis thaliana and Cocos nucifera in most groups except for G5 (potassium channel). Many more genes (21) from C. nucifera than from Arabidopsis thaliana (9 genes) were present in group 5 (potassium channels). The gene family expansion may be associated with the accumulation of potassium ions in coconut water.

Conclusion
Cocos nucifera (2n = 32) is an important tropical crop, and it is also used as an ornamental plant in the tropics. In the present study, we sequenced and de novo assembled the coconut  genome. A total scaffold length of 2.2 Gb was generated, with scaffold N50 of 418 Kb. The divergence time of Cocos nucifera and Elaeis guineensis is more recent than that of Cocos nucifera and Phoenix dactylifera, suggesting a closer relationship between C. nucifera and E. guineensis. Comparative analysis of antiporter and ion channels between C. nucifera and Arabidopsis thaliana showed significant gene family expansions, maybe involving Na + /H + antiporters, carnitine/acylcarnitine translocases, potassium-dependent sodium-calcium exchangers, and potassium channels. The expansion of these gene families may be associated with adaptation to salt stress, accumulation of fatty acid in coconut pulp, and potassium ions in coconut water. The data output of the coconut genome will provide a valuable resource and reference information for the development of highdensity molecular makers, construction of high-density linkage maps, detection of quantitative trait loci, genome-wide association mapping, and molecular breeding.