LanceletDB: an integrated genome database for lancelet, comparing domain types and combination in orthologues among lancelet and other species

Abstract Lancelet (amphioxus) represents the most basally divergent extant chordate (cephalochordates) that diverged from the other two chordate lineages (urochordates and vertebrates) more than half a billion years ago. As it occupies a key position in evolution, it is considered as one of the best proxies for understanding the chordate ancestral state. Thus, the construction of a database with multiple lancelet genomes and gene annotation data, including protein domains, is urgently needed to investigate the loss and gain of domains in orthologues among species, especially ancient domain types (non-vertebrate-specific domains) and novel domain combination, which is helpful for providing new insight into the chordate ancestral state and vertebrate evolution. Here, we present an integrated genome database for lancelet, LanceletDB, which provides reference haploid genome sequence and annotation data for lancelet (Branchiostoma belcheri), including gene models and annotation, protein domain types, gene expression pattern in embryogenesis, different expression sequence tag sets and alternative polyadenylation (APA) sites profiled by the sequencing APA sites method. Especially, LanceletDB allows comparison of domain types and combination in orthologues among type species so as to decode the ancient domain types and novel domain combination during evolution. We also integrated the released diploid lancelet genome annotation data (Branchiostoma floridae) to expand LanceletDB and extend its usefulness. These data are available through the search and analysis page, basic local alignment search tool page and genome browser to provide an integrated display.


Introduction
Chordates comprise three groups: the urochordates (sea squirts), cephalochordates (lancelets) and vertebrates (including the jawless lamprey and hagfish) (1)(2)(3). Lancelet (amphioxus), in particular, represents the most basally divergent living cephalochordate that diverged from urochordates and vertebrates 550 million years ago and retains a body plan and morphology most similar to fossil Cambrian chordates (1,4). Analyses of the Florida lancelet genome (Branchiostoma floridae) have indicated that this chordate did not undergo two rounds of wholegenome duplications (2R-WGD) but rather shares extensive genomic conservation with vertebrates (5). Thus, lancelet occupies an evolutionary key position and has been widely used in research on cephalochordate biology and chordate evolution (6)(7)(8)(9), especially the origin and evolution of vertebrate adaptive immunity (10)(11)(12). Recently, an active RAG transposon containing ProtoRAG (a prototypic recombination-activating gene or RAG), was discovered in the lower chordate lancelet. This sequence encodes RAG1like (L) and RAG2L proteins and meets the structural criteria for the long-sought RAG transposon, illuminating the origins of V(D)J recombination and providing strong evidence in favour of the RAG transposon hypothesis for the origins of jawed vertebrate adaptive immunity (13)(14)(15). Since lancelet has become one of the best proxies for understanding the evolution of chordates and the origin of vertebrates, the construction of a database with multiple lancelet genomes and annotation data, including domain types, is urgently needed to investigate the loss and gain of domains in orthologues among species. In particular, we especially wanted to enable searching the ancient domain types (non-vertebrate-specific domains) and novel domain combinations during evolution from invertebrates to vertebrates, which may provide new insights into the chordate ancestral state.
Based on the sequenced genome of the American Florida lancelet (B. floridae), the JGI Genome Portal lacks another lancelet species (Branchiostoma belcheri). In particular, it contains few next-generation sequencing (NGS) reads to support the transcript candidates and is lacking alternative polyadenylation (APA) sites discovered and identified using an NGS-based approach (16). Another cDNA resource, 'Branchiostoma floridae Gene Collection Release 1', mainly contains short cDNAs and 5 -and 3 -expressed sequence tags (ESTs) from the five developmental stages of Florida lancelet, and hence its impact is very limited (17). We previously produced extensive datasets for B. belcheri, a lancelet distributed widely along the Chinese coast (14,18), to provide additional information on this important evolutionary niche.
Here, we construct and present a web-accessible genomic database, LanceletDB, for two popular lancelet species (B. belcheri and B. floridae). LanceletDB can provide integrated biological information for lancelets, along with search and analysis tools to explore these data. These allow comparison of domain types and combination in orthologues so as to decode the ancient domain types (nonvertebrate-specific domains) and domain combination in chordates. LanceletDB holds a haploid genome sequence for the Chinese lancelet B. belcheri, which was created from the original diploid assembly using the HaploMerger tool reported previously (18,19). HaploMerger, an easy-touse automated pipeline, can be used to reconstruct allelic relationships for polymorphic diploid genome assembly and quickly generate a reference haploid assembly. Thus, the haploid assembly adopted in LanceletDB may represent a better reference assembly for lancelet B. belcheri, because it maintains better sequence contiguity and continuity and benefits the subsequent gene predictions, structural variation detection and other annotation efforts. As a webbased database, LanceletDB provides convenient URLbased retrieval, browsing and presentation of several types of information online, including genome sequences, gene models, gene function and domains in orthologues among type species, gene expression pattern in lancelet embryogenesis, various expression sequence tag (EST) sets and the APA sites profiled by the previously described NGS-based-sequencing APA sites (SAPASs) (20,21). Additionally, we integrate the released diploid lancelet genome annotation data (B. floridae) to expand our LanceletDB and extend its usefulness. These data are available through the search and analysis page, basic local alignment search tool (BLAST) page and genome browser to provide an integrated display of annotation data.

Materials and methods
Generation of haploid genome sequence for B. belcheri As described in our previous report (14,18), the draft genome of the Chinese amphioxus B. belcheri was sequenced from an individual male, and using the Newbler and the Celera assembler (22)(23)(24), a polymorphic diploid assembly (with 4% heterozygosity) was generated from ∼100× raw shotgun and paired-end reads that included both 454 FLX titanium reads (∼30×) and Illumina 115bp mate-pair reads (only for gap filling) (∼70×). The HaploMerger package (18), an automated pipeline, was adopted to untangle allelic relationships in the generated soft-masked diploid assembly and further guide the subsequent creation of reference haploid assembly using the default parameters.

Gene prediction and functional annotation
The prediction of protein-coding gene models, including the functional annotation of corresponding proteins, was described in our previous publication (14).

APA site annotation
The SAPAS method reported previously (20,21), capable of high-throughput sequencing 3'-ends of polyadenylated transcripts, was adopted to identify and annotate polyadenylation sites for lancelet. In short, the total RNAs extracted from the intestinal tissues of Belcheri's lancelets, which were challenged with Vibrio anguillarum or not, were used to prepare SAPAS sequencing libraries. Everything was the exact same as what was described previously (21). After sequencing, utilizing the same computational pipeline developed previously, the obtained SAPAS raw  reads were processed to accurately map and quantify the usage of various poly(A) sites on a genome scale. The generated polyadenylation site datasets were used to annotate the APA sites for lancelet, including the APA sites that support the predicted transcript candidates.

RNA sequencing
Multiple Chinese Belcheri's lancelets in different developmental stages such as oosperm, 4-8 cells, blastula, cap gastrula, cup gastrula, late neurula, 1-gill slit and adult, including several adult tissues, were subjected to RNA sequencing (RNA-seq) using the Illumina GAIIx platform or the 454 FLX titanium platform. The obtained raw reads were filtered and mapped to the lancelet reference genome, termed as 'B.belcheri HapV2(v7h2) genome' (

Database and website design
Based on the generated reference haploid assembly and related annotation information, including the annotated genes and proteins, APA sites and additional RNA-seq and ESTs data, the LanceletDB website was developed with open-source technologies ( Figure 1A). The genome sequences, annotated transcript and protein datasets, APA site datasets and RNA-seq data were integrated to facilitate the query and display of genes in our website (Table 1). For example, the created searching dataset, named B.belcheri HapV2 (v7h2) cds, includes a total of 35 293 annotated gene models from Chinese B. belcheri. Various information regarding location, exonintron structure and expression pattern of genes, poly(A) signal, poly(A) sites and 3 -untranslated regions (3 -UTRs), as well as the corresponding protein annotation including the domains, GO and KEGG, in the LanceletDB, are stored in a relational database using MySQL. The web-based HTML interactive interfaces combined with Java, Perl and PHP scripts can provide access to the database. GD modules of PHP, Bioperl modules and R modules are used for dynamic and graphical representation.

Datasets in LanceletDB
As listed (Table 1), currently, LanceletDB contains lancelet datasets involved in haploid genome sequences, annotated gene and protein models, APA sites, ESTs and RNA-seq

General organization and access of LanceletDB
The general organization of the LanceletDB website is presented ( Figure 1B), and the datasets are available from our LanceletDB web server at http://genome.bucm.edu.cn/ lancelet. The complete set of predicted genes can be quickly queried and presented by a user's keywords in the webquery interface, where the coding sequence (CDS) and protein sequence were loaded with annotation information such as gene name and description, GO, KEGG and domain types. The matched poly(A) sites and poly(A) signals are highlighted in the corresponding genome sequence to facilitate manual checks. The graphical user interface not only dynamically creates graphics to track APA sites and RNAseq/EST coverage data together with the corresponding exon-intron structure of searched gene model, but also generates an additional two output pages to show the expression pattern of the corresponding gene model in lancelet embryogenesis, especially the comparison of domain types and combination between the searched lancelet gene model and its orthologues in other species. The generated datasets were integrated into a genome browser database, via the popular genome browser (Gbrowse) (28), to provide an interactive and graphical view of the genome, transcripts, APA sites and transcript annotations on a genome-wide scale.

Searching LanceletDB
Our 'Gene Search' feature is designed to search the complete dataset of predicted genes in LanceletDB and present gene information for a user's genes of interest according to the searching tips. Currently, gene identifier (id) is allowed for precise query and fuzzy query using keywords such as gene name, symbol and simple description is permitted. Clicking the button labelled 'Example' can yield an example keyword suited for searching a selected dataset (Figure 2A), and the subsets and simple descriptions for the searched dataset can be found ( Figure 2B). Fuzzy search, using the fuzzy keyword 'NLRP' (NLR family, pyrin domain containing), may lead to a media page to list all matched genes in a dynamic table ( Figure 2C). This facilitates the selective view of their corresponding gene information in a linked detail page (described next), but searching with a precise keyword can give users quick access to the detailed page to view various types of information for the gene model and related graphics.   development of lancelet (from 0 hpf to 6 dpf). In general, the expression of the ssr4 gene is high across all major stages of embryogenesis (FPKM value keeps more than 377). The expression of ssr4 decreases first (0-0.5 hpf) but then increases (0.5-5 hpf) and reaches a maximum (FPKM value rises to 1957.4) at 5 hpf (cap gastrula stage), especially after a quick decrease at 6 hpf (cup gastrula stage), the ssr4 expression appears to be stable (FPKM value keeps approximately 500).

Domain types and combination of orthologues among species
Under the 'Orthologues among Species' tab in the detailed page (Figure 4), there is another summary for matched orthologues with high homology to the searched lancelet protein model in several types of species, such as fruit fly (Drosophila melanogaster), lancelet (B. floridae), lamprey (Petromyzon marinus), zebrafish (Danio rerio), mouse (Mus musculus) and human (Homo sapiens) (Figure 4, right). In addition, a set of pictures is generated to detail the predicted domains in these orthologues to provide direct comparison of domain types and locations among orthologues, enabling the investigation of loss and gain of domains, especially novel domain combination during evolution from invertebrate to vertebrate (Figure 4, left).
Here, we take the lancelet protein model of myd88 (myeloid differentiation primary response gene 88), for an example description (accession id 'Bb 172050R' in our LanceletDB). As shown, a set of pictures is created and labelled with the original protein id (available in other public resources). They not only show the length of different myd88 orthologues among species but also present the domain types and location of each myd88 orthologue. It's clear that the myd88 protein mainly consists of two types of domains in human, mouse, zebrafish and lancelet, including the death domain (InterPro: IPR000488) and another Toll/interleukin-1 receptor homology (TIR)

Dynamic and graphical browsing of gene models
Based on the integration of LanceletDB and a genome browser database, via the popular genome browser (Gbrowse) (28), the LanceletDB website provides dynamic browsing of gene models associated with genomes, APA sites and annotations on a genome scale (herein called the 'Genome Browse' feature). Under the 'Browser' tab (Supplementary Figure S4,

Discussion
We present a comprehensive website database for a reference haploid genome and gene models for lancelet, the most basally divergent extant chordate, including B. belcheri and B. floridae. The EST mapping alignment and RNA-seq read mapping coverage data are used to support the actual gene models, especially RNA-seq reads generated from eight lancelet samples corresponding to the major stages of early embryonic development (oosperm, 4-8 cells, blastula, cap gastrula, cup gastrula, late neurula, 1-gill slit and adult). Moreover, based on the developed NGS-dependent 3 -end sequencing strategy, namely, SAPAS (20,21), the generated APA datasets were used to annotate the APA sites for lancelet genes. Therefore, in a sense, we provide additional experimental support for the APA sites and gene models in LanceletDB.
Based on the reference haploid genome assembly generated by the HaploMerger pipeline (18), we predict gene models for B. belcheri lancelet and provide functional annotation for protein models, including GO, KEGG and domain annotation. In particular, Inparanoid (29), an algorithm and tool that finds orthologous genes, helps find the orthologues of lancelet gene models in other types of species such as human, mouse, zebrafish, lamprey and fruit fly. The protein domains in orthologues were identified using the package of InterProScan5 (30). Thus, LanceletDB can provide a direct comparison of domain types and combination in lancelet orthologues among species, which is helpful to investigate the loss and gain of domains in orthologues, especially novel domain combination during evolution from invertebrate to vertebrate.
Overall, LanceletDB holds multiple genome sequences and annotation data for the lancelet species, the best proxies for understanding the chordate ancestral state. As a user-friendly website database, LanceletDB will be an increasingly valuable resource for the genome research community, especially for decoding ancient domain types (non-vertebrate-specific domains) and domain combination in the chordates, providing new insights into the chordate ancestral state and vertebrate evolution.

Data Access
All sequence data from the Belcheri's lancelet genome project have been deposited in GenBank under accession code PRJNA214454. All EST and RNA-seq reads are deposited in the NCBI Sequence Read Archive (http:// www.ncbi.nlm.nih.gov/sra) under accession numbers SRX137009, SRX137010, SRX137015, SRX344155 and SRX344156. The reference haploid assemblies for B. belcheri are available on our website.

Supplementary data
Supplementary data are available at Database online.