The developmental transcriptome atlas of the spoon worm Urechis unicinctus (Echiurida: Annelida)

Abstract Background Echiurida is one of the most intriguing major subgroups of annelida because, unlike most other annelids, echiurids lack metameric body segmentation as adults. For this reason, transcriptome analyses from various developmental stages of echiurid species can be of substantial value for understanding precise expression levels and the complex regulatory networks during early and larval development. Results A total of 914 million raw RNA-Seq reads were produced from 14 developmental stages of Urechis unicinctus and were de novo assembled into contigs spanning 63,928,225 bp with an N50 length of 2700 bp. The resulting comprehensive transcriptome database of the early developmental stages of U. unicinctus consists of 20,305 representative functional protein-coding transcripts. Approximately 66% of unigenes were assigned to superphylum-level taxa, including Lophotrochozoa (40%). The completeness of the transcriptome assembly was assessed using benchmarking universal single-copy orthologs; 75.7% of the single-copy orthologs were presented in our transcriptome database. We observed 3 distinct patterns of global transcriptome profiles from 14 developmental stages and identified 12,705 genes that showed dynamic regulation patterns during the differentiation and maturation of U. unicinctus cells. Conclusions We present the first large-scale developmental transcriptome dataset of U. unicinctus and provide a general overview of the dynamics of global gene expression changes during its early developmental stages. The analysis of time-course gene expression data is a first step toward understanding the complex developmental gene regulatory networks in U. unicinctus and will furnish a valuable resource for analyzing the functions of gene repertoires in various developmental phases.


Background
Within the major annelid groups, Echiurida (also called "marine spoon worms") is represented by a morphologically and ontogenetically unique assemblage that includes approximately 165 species, most of which lack segmentation as adults. However, they possess annelid-like morphological and developmental features, including the organization of the larval nervous system [1,2]. They were once considered a separate metazoan phylum. However, reevaluation of morphological and molecular data indicated that Echiurida is nested within Annelida, which represents one of the three major animal phyla with body segmentation [3][4][5][6][7]. In this respect, transcriptome analyses from various developmental stages of echiurid species are of substantial value for understanding precise expression levels and the complex regulatory networks involved in early and larval development. Indeed, data from recently published developmental transcriptomes of other Lophotrochozoans (e.g., Aplysia californica and Platynereis dumerilii) have highlighted insights into molecular mechanisms underlying early development and metamorphosis [8,9].
Urechis unicinctus is an echiuran species that inhabits burrows in soft sediments in intertidal areas (Fig. 1). The Urechis genus may hold important clues to the genetic basis of the evolutionary gain and loss of segmentation due to its nested position within Annelida (i.e., sister to capitellid polychaetes), a lophotrochozoan phylum that is represented by a diverse group of segmented worms [4,7]. However, current knowledge is limited on the molecular mechanisms that underlie the ontogeny of U. unicinctus. The goal of this study is to enhance our understanding of gene expression during embryonic development. Here, we report the transcriptome profiles (generated with the Illumina HiSeq platform) of developing embryos of U. unicinctus. Transcriptome sequencing data assist in the discovery of the roles of genes involved in various embryological and larval development processes. As the first large-scale transcriptomic dataset for U. unicinctus, this resource will help in the validation of development-specific gene features predicted by the genome.

Sample collection, embryo culture, and RNA isolations
Adults of U. unicinctus were collected from intertidal mud flats on the southern coast of South Korea. We extracted eggs and sperms from 1 adult female and 1 male. To obtain U. unicinctus embryos, artificial fertilization was performed by mixing the appropriate ratio of sperms and eggs.
Embryos were reared in artificial seawater (reef crystals from Aquarium Systems, France) in a plastic case at room temperature (18 • C-20 • C). The late trochophore, a typical larval stage in which the intestinal tract is formed, was fed with a microalgae called Isochrysis galbana. Reared embryo samples were collected at each of the following stages: 0 hour (unfertilized egg), 0.5 hours post-fertilization (fertilized egg), polar body cell, 2 cell, 4 cell, 8 cell, 16 cell, 32 cell, blastula, emerged cilia, early trochophore (day 1), middle trochophore (day 2), late trochophore (day 5), and segmentation stage (day 30-45). Diagnostic features for each of the 3 trochophore stages are as follows. The early trochophore is a nonfeeding stage. In the middle trochophore, the gastrointestinal valve opens and the anus appears. In the late trochophore, the longer cilia of the apical tufts are replaced by shorter cilia that cover a greater area, and the prototroch cilia are longer. These developmental stages follow Newby's classification [10].
Total RNA was isolated from the embryos of the above samples using TRIZOL reagent (Invitrogen, Carlsbad, California) following the manufacturer's instructions. The purity and integrity of the total RNA isolated from each embryo sample were examined using a Nanodrop 2000C spectrophotometer (Thermo Scientific, Waltham, Massachusetts) and Bioanalyzer 2100 (Agilent Technologies, Palo Alto, California). Adult images were taken on a Canon EOS 550D, and embryo bright-field images were taken on a Leica DM6 B microscope using differential interference contrast (DIC) optics.

TruSeq Stranded Ribo-Zero library preparation and sequencing
Total RNA concentration was calculated using Quant-IT Ribo-Green (Invitrogen, R11490). To assess the integrity of the total RNA, samples were run on TapeStation RNA screentape (Agilent, 5067-5576). Only high-quality RNA preparations, with a RNA Integrity Number greater than 7.0, were used for RNA library construction. A library was independently prepared with 1 μg of total RNA for each sample using an Illumina TruSeq Stranded Total RNA Sample Prep Kit (Illumina, Inc., San Diego, California). The rRNA in total RNA was depleted using a Ribo-Zero kit. After the rRNA was depleted, the remaining RNA was purified, fragmented, and primed for cDNA synthesis. The cleaved RNA fragments were copied into first-strand cDNA using reverse transcriptase and random hexamers. This was followed by second-strand cDNA synthesis using DNA Polymerase I, RNase H, and dUTP. These cDNA fragments then underwent an end repair process, the addition of a single "A" base, and ligation of the adapters. The products were then purified and enriched with polymerase chain reaction (PCR) to create the final cDNA library. The libraries were quantified using quantitative PCR according to the qPCR Quantification Protocol Guide (KAPA Library Quantification kits for Illumina Sequencing platforms) and qualified using the TapeStation D1000 ScreenTape (Agilent Technologies, Waldbronn, Germany). The resulting samples were sequenced on the Illumina HiSeq 2000 system with a paired-end read with 101 cycles or the Illumina HiSeq 4000 system with a paired-end read with 151 cycles ( Table 1). The experimental procedures and complete assembly pipeline are summarized in Fig. 2.

Transcriptome preprocessing and de novo assembly
After completion of the sequencing run, to obtain high-quality clean reads from the raw data (i.e., removing those containing adapter sequences, poly-N sequences, or low-quality bases), we performed quality-based trimming and filtering using Trimmomatic, version 0.33 (Trimmomatic, RRID:SCR 011848) [11] (Table 1). Before de novo assembly, all clean reads were pooled without normalization of read abundance, even though the use of all merged reads may require progressively increasing assembly time and memory usage in order to obtain a comprehensive reference transcriptome database. The merged reads were used for de novo transcriptome assembly using Trinity, version 2.1.1 (Trinity, RRID:SCR 013048) [12] with default parameters. The resulting assembled transcriptome consisted of 620,490 transcripts with an N50 value of 846 bp ( Table 2). After assembly, open reading frames (ORFs) were predicted using TransDecoder (version 3.0.0) (http://transdecoder.sourceforge.net). To maximize sensitivity for capturing ORFs, all transcripts were aligned against the Uniprot/Swiss-Prot database (http://www.uniprot.org) via BLASTP search with an E-value cutoff of 10 −5 . Next, ORF lengths <100 amino acids were discarded to avoid maintaining transcripts with poor evidence for protein-coding regions. Finally, redundant transcripts with more than 99% sequence identity were removed using CD-HIT (version 4.6.5) [13], producing 60,472 nonredundant ORFs. These sequences span 63,928,225 bp with an N50 length of 2,700 bp.
To quantify expression levels, the reads for each library were mapped independently to the reference U. unicinctus transcriptome sequences using Bowtie, version 2.2.6 (Bowtie, RRID:SCR 005476) [14]; expression levels of these transcripts were estimated with RSEM, version 1.2.26 (RSEM, RRID:SCR 013027) [15]. The unit of expression level is referred to as fragment per kilobase of transcript per million fragments mapped in our analyses.

Annotation
To annotate coding sequences (CDSs), the resulting 60,472 CDSs were compared against the NCBI nonredundant protein (NR) database (downloaded on 11 April 2017) using BLASTP with an E-value cutoff of 10 −10 and the best BLAST hit. About 66% (40,111/60,472) of the CDS were assigned to superphylum-level taxa, including Lophotrochozoa (40%), Deuterostomia (8%), and Panarthropoda (2%) (Fig. 3A), which was to be generally expected. For further analysis, we excluded a number of CDSs (18%; 7,231/40,111) by using sequences derived from nonmetazoan taxa. When there were multiple coding sequences that mapped to the same gene in the NR database, the sequences with the longest CDS were first assigned to that gene. Based on this criterion, we established a comprehensive transcriptome database of 14 early developmental stages of U. unicinctus that comprises 20,305 representative functional protein-coding transcripts. We further assessed the completeness of the U. unicinctus development transcriptome using the program benchmarking universal single-copy orthologs, version 2.0 (BUSCO, RRID:SCR 015008) [16]. A total of 75.9% (230/303 genes) and 75.7% (740/978 genes) of the eukaryote and metazoan single-copy orthologs were identified, respectively (Fig. 3B).

Transcriptome comparisons
To show that gene expression reflects development-specific differentiation and maturation processes, we built expression distance matrices for each developmental stage and constructed a gene expression tree (Fig. 3C). Two major transitions in expression patterns were observed: blastula to emerged cilia and late trochophore to segmentation. These transitions divided the 14 U. unicinctus developmental stages into 3 phases. The oocyte; polar body; fertilized; 2-, 4-, 8-, 16-, 32-cell embryo; and blastula stages make up phase I. The emerged cilia and early-, middle-, and late-trochophore stages make up phase II. The segmentation stage makes up phase III. These 3 distinct phases of global transcriptome profiles covering 14 developmental stages were supported by principal component analysis, which was performed using the "prcomp" function in the "stats" package in R (version 3.2.4) (Fig. 3C). These results suggest that developmental stages are well characterized by our transcription profiles and that the differential gene expression profiles presented in this study will be useful for further study of ontogenic processes at the gene expression level.
In an additional analysis, a gene whose expression level was significantly changed (≥10-fold and false discovery rate adjusted P value ≤ 0.1%) in at least one comparison was defined as a developmentally regulated gene. We identified 12,705 genes that showed dynamic regulation patterns during the differentiation and maturation of U. unicinctus cells (Fig. 4). Note that we used the trimmed mean of M values normalization [17] provided by the edgeR bioconductor package for R for this test.
Although this study presents the first large-scale developmental transcriptome dataset for a developmentally interesting animal group, U. unicinctus (Echiurida), the global landscape of its developmental transcriptome is not yet complete due to the lack of biological replicates and reference genome sequences. In summary, we present the first large-scale, developmental, stage-specific transcriptome dataset for U. unicinctus and provide a general overview of the dynamics of global gene expression changes at different developmental stages. These data will fill an important gap in annelid-wide comparisons of gene expression patterns and will lead to a better understanding of gene repertoires involved in different developmental stages and of complex developmental gene regulatory networks.

Availability of supporting data
All raw sequencing data used for assembly have been deposited in the NCBI database under the accession numbers SRX2999418-SRX2999431, associated with BioProject PRJNA394029. Additional data further supporting the results of this article, including the transcriptome assembly, annotations, and BUSCO results, can be found in the GigaScience repository, GigaDB [18].