De novo genome assembly of the red silk cotton tree (Bombax ceiba)

Abstract Background Bombax ceiba L. (the red silk cotton tree) is a large deciduous tree that is distributed in tropical and sub-tropical Asia as well as northern Australia. It has great economic and ecological importance, with several applications in industry and traditional medicine in many Asian countries. To facilitate further utilization of this plant resource, we present here the draft genome sequence for B. ceiba. Findings We assembled a relatively intact genome of B. ceiba by using PacBio single-molecule sequencing and BioNano optical mapping technologies. The final draft genome is approximately 895 Mb long, with contig and scaffold N50 sizes of 1.0 Mb and 2.06 Mb, respectively. Conclusions The high-quality draft genome assembly of B. ceiba will be a valuable resource enabling further genetic improvement and more effective use of this tree species.


Yi Fu
Bin Tian

Lizhou Tang
Order of Authors Secondary Information: Response to Reviewers: Dear editor and reviewers, The manuscript "De novo genome assembly of the red silk cotton tree (Bombax ceiba)" (GIGA-D-18-00045R1) has been carefully revised according to the reviewers' suggestion. The major revisions are marked in red.
Reviewer #2: 1. For my previous question 1: the authors pointed out that heterozygosity may affect the estimation of genome size by using Kmers. Do authors believe <1% heterozygosity rate can lead to ~100 Mb assembly differences (the final assembly is 895Mb)? Is that possible 17-mer underestimated the genome size (I understand that in BGI's paper they used 17-mer to estimate giant panda's genome size. Is 17-mer suitable for B.ceiba? If authors test different kmers, I suppose you will have different results). Answer: As the reviewer suggested, we reanalyzed the genome size with other K-mers (19-mer and 21-mer). The estimated genome size was 835 Mb (19-mer) and 821 Mb (21-mer), respectively. The results did not dramatically depart from the genome size estimated with 17-mer (809 Mb). And our previously study of flow cytometry also suggested that the genome size of B. ceiba was approximately 800Mb (2C =1.55±0.03pg) [1]. The heterozygosity rate of B. ceiba genome was 0.88%. As pointed by many researchers, genomes with heterozygosity rate higher than 0.5% are considered as highly heterozygous [2]. Assembling highly heterozygous diploid genomes is a substantial challenge, and heterozygous regions could not be assembled into consensus may result in larger assembly [2]. So we concluded that the highly heterozygous genome of B. ceiba might be the main reason why there were ~100 Mb differences between the estimated genome size and the final assembly.
2. For my previous question 2: I appreciate that authors used BLASTN to confirm contaminations. However, shouldn't authors use the non-plant database instead of bacteria? Why did authors randomly select some contigs (how many?) instead of all of them? I understand that using random selection to avoid bias, but since the contamination rate is low (I suppose), you will have less chance to select a contamination contig if you only select a few contigs from the pool. Answer: We accepted the reviewer's suggestion, and we searched all sequences of the genome assembly against the NCBI nucleotide collection (Nt) with BLASTN with Evalue < 1e-5 and sequence identity > 70%. In total, 2494 significant hits were achieved. The top-hit species were Theobroma cacao and Gossypium species, comprising more than 69% of the hits (1733 hits). Only five hits from four non-plant species (Psyllidae sp., Trioza eugeniae, Diptacus sp., and Dichorragia nesimachus) were detected, which suggested there was no potential contamination from non-plant species in the genome of Bombax ceiba. Please see line 89-95 in the revised manuscript and New questions: 1. why did authors change the final assembled genome size from 869Mb to 895Mb, but didn't change any stats, is anything wrong with the previous calculation? Answer: We are sorry for this matter. During the initial submission, we made a mistake and took the contig size (869Mb) as the assembled genome size due to negligence. So we changed the size of the final genome assembly to 895Mb (the scaffold size) in the revised manuscript. We appreciate very much for this comment.  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Phylogenetic analysis was performed using 172 single copy orthologous genes from common gene 185 families found by OrthoMCL [34] (Fig. S5). We codon-aligned each gene family using MUSCLE   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 hot-dry valleys [7]. According to the neutral theory of molecular evolution [42], the ratio of  (Table S15). One gene is homologous to a desiccation protectant protein coding gene (Lea14).

214
There is a strong association of LEA proteins with abiotic stress tolerance, particularly during

224
It should be noted that this is just a preliminary analysis of the functions of these genes, further studies 225 would be needed to clarify their roles. 235 zibethinus went through their WGD events before diverging from their common ancestor (Fig. 2c).

Figure 2
Click here to download Figure Figure 2.eps