A draft genome assembly of halophyte Suaeda aralocaspica, a plant that performs C4 photosynthesis within individual cells

Abstract Background The halophyte Suaeda aralocaspica performs complete C4 photosynthesis within individual cells (SCC4), which is distinct from typical C4 plants, which require the collaboration of 2 types of photosynthetic cells. However, despite SCC4 plants having features that are valuable in engineering higher photosynthetic efficiencies in agriculturally important C3 species such as rice, there are no reported sequenced SCC4 plant genomes, limiting our understanding of the mechanisms involved in, and evolution of, SCC4 photosynthesis. Findings Using Illumina and Pacific Biosciences sequencing platforms, we generated ∼202 Gb of clean genomic DNA sequences having a 433-fold coverage based on the 467 Mb estimated genome size of S. aralocaspica. The final genome assembly was 452 Mb, consisting of 4,033 scaffolds, with a scaffold N50 length of 1.83 Mb. We annotated 29,604 protein-coding genes using Evidence Modeler based on the gene information from ab initio predictions, homology levels with known genes, and RNA sequencing–based transcriptome evidence. We also annotated noncoding genes, including 1,651 long noncoding RNAs, 21 microRNAs, 382 transfer RNAs, 88 small nuclear RNAs, and 325 ribosomal RNAs. A complete (circular with no gaps) chloroplast genome of S. aralocaspica 146,654 bp in length was also assembled. Conclusions We have presented the genome sequence of the SCC4 plant S. aralocaspica. Knowledge of the genome of S. aralocaspica should increase our understanding of the evolution of SCC4 photosynthesis and contribute to the engineering of C4 photosynthesis into economically important C3 crops.

approach with gap filling using low coverage PacBio data. This resource will be useful for the comparative genomics and C₄ research communities. I have some concerns I feel should be addressed before this manuscript is suitable for publication.

Major:
Were the gaps that were filled by PacBio data polished? Raw PacBio reads have an error rate of ~5-10% and these patched regions would be full of errors if there was no polishing. The Raw PacBio reads could be polished against themselves prior to gap filling with PBJelly or the regions could be polished with Illumina data using a program such as Pilon.
Response: We appreciate the reviewer's comment. Sorry that we missed this important information in the last version of the manuscript. The details have been added to the revised manuscript (lines 121-124). In this study, raw PacBio reads were polished with Illumina data using Proovread (https://github.com/BioInf-Wuerzburg/proovread) to ensure the high accuracy of the PacBio reads used for genome assembly.
Minor: Figure 2 is not particularly informative and slightly misleading. Circos plots of chromosomes/pseudomolecules can be useful for looking at gene and repeat density as well as macrosynteny (for polyploids and for detecting whole genome duplication events). Here, the 4,033 contigs are stitched together into one continuous block and various features are highlighted. Because most scaffolds are relatively small, rolling windows of features seem meaningless (particularly track i).
Response: We concur with the reviewer's comment, and decided to remove this figure from the revised manuscript accordingly.
Line 205: more details on what MAKER parameters were used are needed, particularly which gene prediction programs were used (i.e. Snap, Augustus, etc). The BUSCO completeness of the annotation should also be provided.
Response: Thanks for the reviewer's comment, and detailed MAKER parameters were added to revised manuscript (lines 224-231) accordingly. De novo predictions were processed by AUGUSTUS (v3.2.1). The BUSCO completeness of the annotation was provided at Lines 212-214.
I would caution the authors against using 'first' in the title. To my knowledge, this is the first plant with single cell C₄ to be sequenced, but that might not be the case when this paper is published.
Response: Thanks for this specific comment. The title was changed to: "A draft genome assembly of halophyte Suaeda aralocaspica, a plant that performs C₄ photosynthesis within individual cells".
Reviewer #2: Gigascience 4.2019 Here, the authors present the wet lab work, some bioinformatics information and short analysis of the genome of S. aralocaspica, the first SCC₄ genome available. This is a short technical / data report of the public availability of a genome. I appreciate the simplicity this type of article; however, I have a number of scientific and data presentation comments. These are detailed below.
Genome analysis: The genome analyses (phylogenetics, gene families) leave much to be desired. They are not well described, interpretation is nearly completely lacking and the scope of the analyses are very narrow. I understand that this is a data note; however, if such analyses are going to be conducted, they should be done in a rigorous manner, with sufficient replication, and include a complete methodological description and scientific interpretation. What does 'SCC₄ would independently evolve in this family' mean? Is it that the trees indicate that SCC₄ is a derived trait in the family? If so, to make this claim, you would need to analyze gene / species trees with many more species. It could easily be that the observed pattern is simply due to very small sample size. Same goes for the private-gene family analysis. If the authors want to actually perform this analysis they need to add many more genomes. This is in part to capture the evolution of these families, and in part because sequence alignments of highly diverged and sparsely sampled sequences are highly biased and inaccurate.
Response: Many thanks for your critical comments. We have re-performed the comparative genome analysis, using all annotated protein sequences from 18 sequenced plant genomes including eight C₃ species, eight C₄ species and two CAM species. The analysis We also rewrote the result presentation under the subtitle of "Phylogenetic placement of S. aralocaspica" accordingly. In fact, we tried to use more species, however, which led to no single copy gene for analysis. In the revised version, we also added a couple of interpretation and expanded the scope of the analyses.
Text: There are a lot of proofreading and grammar errors. The most obvious is line 282: 'We annotated xxx protein-coding genes and noncoding genes' … This type of error is very concerning and makes me skeptical that the authors put in the effort to check accuracy of the numbers presented throughout. Response: To avoid this type of errors, we have sent our revised manuscript to a commercial copy editor in International Science Editing (http://www.internationalscienceediting.com) to improve the language.
The title is misleading and makes the reader think that you have sequenced a unicellular organism. Response: Thanks for your comment. We have change the title to : "A draft genome assembly of halophyte Suaeda aralocaspica, a plant that performs C₄ photosynthesis within individual cells" to avoid the misunderstanding.
The term 'high quality' is used a lot. To me, this is not helpful. High-quality means different things to different people. The genome presented here is of low or moderate quality to those working in model systems, or those with recently sequenced genomes. However, it is certainly high enough quality for a species with little genomic information. I would drop this term throughout and instead focus on the exact contiguity. Response: Thanks for you for your comments. We have dropped this term throughout..

Methods:
The methodological details focus on lab protocols, but have very little information about sample sizes, plant growth, etc. for example: How were plants grown for RNA extraction? How many biological replicates? N. technical replicates? Was RNA extracted from the exact same plant as gDNA? If not, was it the same genotype? If so, how was it clonally propagated? All of these questions and many others must be answered. Please provide georeferenced location information for the sample collection site. Please provide the collection permits and all other information about how the sample was obtained.
Response: Thanks for you for your comments. We have added the suggested information as a separated part called "Plant materials" at line 92 in revised manuscript.
How were PacBio reads error corrected?
Response: The raw reads with length < 1kb were filtered and then 46 Gb of Illumina clean reads with 100 bp read length was used to correct the PacBio raw reads using Proovread (https://github.com/BioInf-Wuerzburg/proovread). The detailed method were added to the revised manuscript (line 121).
How is % gap calculated? on a per-bp basis? Line 126. If so, how large are the gaps (10k Ns?) Response: A "gap" is a run of Ns of any length, % gap was calculated using total length of Ns divided to the total length of assembly. In the final assembly, there are 35949 Gaps, the total length of Gaps is 13456548, in average each gap is 374 Ns. The results were added to the revised manuscript (line 148). Mechanistically, the spatially separated chloroplasts in SCC4 contain different sets of nuclear-62 encoded proteins that are related to specific functions in the C4 and C3 cycles, which 63 biochemically and functionally resemble mesophyll and bundle sheath cells in chloroplasts of 64 Kranz C4 plant species [10,11,[15][16][17][18]. These findings indicate that the key enzymes in 4 / 29 photosynthesis are conserved and that both C3 and C4 enzymes work in the same cells in SCC4 66 plants during the day time, which is different from both C4 and CAM plants. 67 At present, most knowledge of SCC4 photosynthesis has come from studies of Bienertia 68 sinuspersici, which has two types of chloroplasts distributed in the central and peripheral parts 69 of the cell [16,[18][19][20][21][22][23][24][25][26][27][28][29]. Studies on S. aralocaspica have focused on the germination of dimorphic 70 seeds [30][31][32][33][34]. S. aralocaspica has elongated photosynthetic cells with two types of chloroplasts 71 distributed at the opposite ends of the cell. This is analogous to the Kranz anatomy, but lacks 72 the intervening cell wall [35]. This cellular feature indicates that S. aralocaspica conducts C4 73 and C3 photosynthesis within a single cell, perhaps retaining the photosynthetic characteristics 74 of both C4 and C3 cycles and representing an intermediate model of the evolutionary process 75 from C3 to C4 [35, 36]. S. aralocaspica is a hygro-halophyte that grows in temperate salt deserts 76 with low night temperatures in areas from northeast of the Caspian lowlands east to Mongolia 77 and western China [35]. Therefore, it is essential to sequence the genome of S. aralocaspica, 78 which may aid in studying C4 evolution under stressful growth conditions and in accelerating 79 engineering C4 photosynthesis into C3 crops for adaptation to high saline growth conditions. 80 In the present study, we sequenced the genome of S. aralocaspica collected from a cold 81 desert in the Junggar Basin, Xinjiang, China. Using an integrated assembly strategy that 82 combined shotgun Illumina sequencing and single-molecule real-time sequencing technology 83 from Pacific Biosciences (PacBio), we generated a reference genome assembly of S. 84 aralocaspica using protocols established in other plant species [37][38][39][40]. To our best knowledge, 85 this is the first sequenced SCC4 genome. These genomic resources provide a platform for 86 advancing basic biological research and gene discovery in SCC4 plants, as well as for 87 5 / 29 engineering C4 functional modules into C3 crops to increase yields and to adapt to high-salt 88 conditions. 89 90

Plant material 92
First, seeds were collected from a healthy S. aralocaspica ( Figure 1). The selected plant 93 measured ∼40 cm in height and was located within a natural stand close to 94 Xinjiang Uygur Autonomous Region, China (N 44°14′ latitude, E 87°40′ longitude, 445 m 95 elevation). The seeds were placed in 0.1% potassium permanganate, washed clean with 96 ultrapure water after 5 min, and then spread in sterilized petri dishes. After a week of 30°C 97 shaded culturing, the seeds germinated. After seed germination, leaves were collected as tissue 98 sources for whole-genome sequencing. In addition, six other healthy S. aralocaspica (collected 99 in same location with the plant for seeds collection) were chosen as tissue (mature leaf, stem, 100 root and fruit) sources for RNA sequencing (RNA-seq). The samples were frozen in liquid 101 nitrogen immediately after being collected and then stored at −80°C until DNA/RNA extraction. 102 All the samples were collected with permission from and under the supervision of local forestry 103 bureaus. 104

DNA extraction and genome sequencing 105
Genomic DNA was extracted from leaves using a General AllGen Kit (Tiangen biotech, Beijing, 106 China) according to its manufacturer's instructions. Genomic DNA isolated from S. 107 aralocaspica was used to construct multiple types of libraries, including short insert size (350, 108 500, and 800 bp) libraries, mate-paired (2, 5, 10, and 20 kb) libraries, and PacBio single-  Table 1). 114 To reduce the effects of sequencing errors on the assembly, a series of stringent filtering 115 steps were used during reads generation. We cleaned Illumina reads using the following steps: 116 (1) Cut off adaptors. For the mate-paired library data, reads without Nextera adaptors longer 117 than 10 bp on both end1 and end2 were removed; (2) Remove tail bases with quality score less 118 than 20; (3) Remove reads harboring more than 20% bases with quality scores less than 20; (4) 119 Remove reads with lengths less than 30 nt for DNA-seq; and (5) Remove duplicated paired-120 end reads from DNA-seq that represent potential PCR artefacts. In total, 1,053,309 raw 121 subreads were produced by Pacbio. Then, reads with lengths < 1 kb were filtered, and 935,509 122 reads were retained. Next, 46 Gb of Illumina clean reads with 100-bp read lengths was used to 123 correct the PacBio raw reads using Proovread [41] (v2). This yielded 632,805 corrected PacBio 124 reads. After the quality control and filtering steps, 195 Gb clean Illumina reads and 6.9 clean 125 PacBio reads were retained, resulting in a 433× fold coverage of the genome (Supplemental 126 refers to a sequence with a length of k bp, and each unique k-mer within a genome dataset can 130 be used to determine the discrete probability distributions of all possible k-mers and their 7 / 29 frequencies of occurrence. Genome size can be calculated using the total length of sequencing 132 reads divided by sequencing depth. To estimate the sequencing depth of the S. aralocaspica 133 genome, we counted the copy number of a certain k-mer (e.g., 17-mer) present in the sequence 134 reads and plotted the distribution of the copy numbers. The peak value of the frequency curve 135 represents the overall sequencing depth. We used the algorithm N × (L − K + 1)/D = G, where 136 N represents the total sequence read number, L represents the average length of sequence reads 137 and K represents the k-mer length, which was defined here as 17 bp. G denotes the genome 138 size, and D represents the overall depth estimated from the k-mer distribution. Based on this 139 method, the estimated genome size of S. aralocaspica was 467 Mb (Supplemental Figure 1) 140 and the heterozygosity was 0.16%. 141

Genome assembly 142
The primary assembled genome was generated by SOAPdenovo [43]     was treated with RQ1 DNase (Promega) to remove DNA. The quality and quantity of the 163 purified RNA were determined by measuring the absorbance at 260 nm/280 nm (A260/A280) 164 using smartspec plus (BioRad). RNA integrity was further verified by 1.5% agarose gel 165 electrophoresis. RNAs were then equally mixed for RNA-seq library preparation. 166 Polyadenylated mRNAs were purified and concentrated with oligo(dT)-conjugated magnetic 167 beads (Invitrogen) before directional RNA-seq library preparation. Purified mRNAs were 168 fragmented at 95°C, followed by end repair and 5′ adaptor ligation. Reverse transcription was 169 performed using an RT primer harboring a 3′ adaptor sequence and a randomized hexamer. The 170 cDNAs were purified and amplified, and PCR products corresponding to 200-500 bp were 171 purified, quantified and stored at −80°C before sequencing. Transcriptomic libraries were 172 sequenced using HiSeq X Ten for paired-end 150-nt reads. As a result, we generated 30 Gb of 173 RNA-seq data (Supplemental Table 3). 174 To further annotate transcriptional start and termination sites, we also sequenced cap 175 analysis of gene expression and deep sequencing (CAGE) and polyadenylation site sequencing 176 (PAS) data. In brief, 20 g of total RNA of mature leaves was used for CAGE-seq library 177 preparation. Polyadenylated mRNAs were purified and concentrated with oligo (dT)-178 conjugated magnetic beads (Invitrogen). After treating with FastAP (Invitrogen) for 1 h at 37°C 179 and subsequently with tobacco acid pyrophosphatase (Ambion) for 1 h at 37°C, the decapped 180 full-length mRNA was ligated to the Truseq 5′ RNA adaptor (Illumina) for 1 h at 37°C and 181 purified with oligo (dT)-conjugated magnetic beads (Invitrogen). Following fragmentation at 182 95°C, first-strand cDNA was synthesized using an RT primer harboring the Truseq 3′ adaptor 183 sequence (Illumina) and a randomized hexamer. The cDNAs were purified and amplified using 184 10 / 29 Truseq PCR primers (Illumina), and products corresponding to 200-500 bp were purified, 185 quantified and stored at −80°C until sequencing. CAGE-seq libraries were sequenced with 186 Illumina Nextseq 500 for paired-end 150-nt reads. Finally, 16 Gb of CAGE-seq data were 187 generated (Supplemental Table 3). In addition, 10 g of total RNA of mature leaves was used 188 for PAS-seq library preparation. In brief, polyadenylated mRNAs were purified using oligo 189 (dT)-conjugated magnetic beads (Invitrogen). Purified RNA was fragmented and then reverse 190 transcription was performed using a PAS-RT primer (a modified Truseq 3′ adaptor harboring 191 dT18 and two additional anchor nucleotides at the 3′ terminus). DNA was then synthesized with 192 Terminal-Tagging oligo cDNA using a ScriptSeq™cv2 RNA-Seq Library Preparation Kit 193 (Epicentre). The cDNAs were purified and amplified, and PCR products corresponding to 300-194 500 bp were purified, quantified and stored at −80°C before sequencing. PAS-seq libraries were 195 sequenced with Illumina Nextseq 500 for single-end 300-nt reads. Finally, 28.5 Gb of PAS-seq 196 data were generated (Supplemental Table 3). 197 To annotate microRNA (miRNA), a total of 3 g of mixed total RNA was the template for 198 a small RNA cDNA library preparation using Balancer NGS Library Preparation Kit for 199 small/microRNA (GnomeGen), following the manufacturer's instructions. Briefly, RNAs were 200 ligated to 3′ and 5′ adaptors sequentially, reverse transcribed to cDNA and PCR amplified. The 201 whole library was applied to 10% native PAGE gel electrophoresis, and bands corresponding 202 to miRNA insertions were cut and eluted. After ethanol precipitation and washing, the purified 203 small RNA libraries were quantified using a Qubit Fluorometer (Invitrogen) and stored at 204 −80°C until sequencing. The small RNA library was sequenced with Illumina GA IIx for 33-nt 205 reads. Finally, 4.5 Gb of small RNA data were generated (Supplemental Table 3). 206

Genome quality evaluation 207
Different methods and data were employed to check the completeness of the assembly. Using 208 BWA [46], we found that 87.08%-90.63% of DNA-paired end reads (350, 500, and 800 bp) 209 could be properly mapped to the final assembled genome (Supplemental Table 4, Supplemental 210 Figure 2). We evaluated the completeness of the gene regions in our assembly using BUSCO 211 [47] (v3.0.2). In total, 89.5% of the 1,440 single-copy orthologs presented in the plant lineage 212 was completely identified in the genome (Supplemental Figure 3). 213 Furthermore, Trinity [48] (r20140413p1) was used to assemble the RNA-seq reads 214 sequenced from the mixed S. aralocaspica RNA library into 157,521 unigenes. Then, these 215 unigenes were aligned to the genome assembly by BLASTN with default parameter. We found 216 94.5% of the unigenes could be aligned to the genome assembly, and 76.3% of the unigenes 217 could cover 90% of the sequence length of one scaffold. For unigenes longer than 1 kb, 99.5% 218 of the unigenes could be aligned to the genome assembly, and 92.8% of the unigenes could 219 cover 90% of the sequence length of one scaffold (Supplemental Table 5). 220

Gene and functional annotations 221
The genome of S. aralocaspica was annotated for protein-coding genes (PCGs), repeat 222 elements, non-coding genes and other genomic elements.  6 and 7). Of the annotated PCGs, 97.2% were functionally annotated by 232 the InterPro, GO,KEGG,SwissProt or NR databases (Supplemental Figures 4 and 5,233 Supplemental Table 8), and ~91% were annotated with protein or transcript support 234 (Supplemental Table 9). The transcriptional start and termination sites of most of the annotated 235 genes were supported by sequencing reads from CAGE-seq and PAS-seq (Supplemental 236 Figures 6 and 7). 237 In addition, 1, 651 long noncoding RNAs were predicted following a previously published 238 with option --cut_ga (Supplemental Table 10, Supplemental Figure 8). 242

Repeat annotation 243
To annotate the repeat sequences of the S. aralocaspica genome, a combination of de novo and 244 homology-based approaches was employed [56,57]  and Pennisetum glaucum) and two CAM species (Ananas comosus and Phalaenopsis equestris). 266 The longest proteins encoded by each gene in all species were selected as input for OrthoFinder 267 with default parameters. In total, 19,324 orthogroups, containing at least two genes, were 268 circumscribed, 11,768 of which contained at least one gene from S. aralocaspica (Supplemental 269   Table 13). Of the 29,604 annotated S. aralocaspica genes, 23,112 (89%) were classified into

Assembly of the S. aralocaspica chloroplast genome 295
Using the short insert size (350 bp) data, a complete (circular with no gaps) chloroplast genome 296 of S. aralocaspica was assembled at 146,654 bp in length using NOVOPlasty [70] (v2.7.2). 297 The Rubisco-bis-phosphate oxygenase (RuBP) subunit of C. quinoa (GenBank: KY419706.1) 298 was selected as a seed sequence. An initial gene annotation of the genome was performed using 299 evolution and mechanisms. We anticipate that future studies of S. aralocaspica will greatly 316 facilitate the process of engineering crops, especially C3 species, including rice, with higher 317 photosynthetic efficiencies and saline tolerance.