realDB: a genome and transcriptome resource for the red algae (phylum Rhodophyta)

Abstract With over 6000 species in seven classes, red algae (Rhodophyta) have diverse economic, ecological, experimental and evolutionary values. However, red algae are usually absent or rare in comparative analyses because genomic information of this phylum is often under-represented in various comprehensive genome databases. To improve the accessibility to the ome data and omics tools for red algae, we provided 10 genomes and 27 transcriptomes representing all seven classes of Rhodophyta. Three genomes and 18 transcriptomes were de novo assembled and annotated in this project. User-friendly BLAST suit, Jbrowse tools and search system were developed for online analyses. Detailed introductions to red algae taxonomy and the sequencing status are also provided. In conclusion, realDB (realDB.algaegenome.org) provides a platform covering the most genome and transcriptome data for red algae and a suite of tools for online analyses, and will attract both red algal biologists and those working on plant ecology, evolution and development. Database URL: http://realdb.algaegenome.org/


Introduction
Red algae (phylum Rhodophyta) have various values in our daily life. They are important sources of food, such as nori used in sushi and pudding made of Irish moss. The high content of vitamins and proteins of red algaederived foods has made them attractive and popular in east Asia for >1000 years (1). Red algae have valuable ecological roles, such as producing oxygen in the seawater while some species are important in the formation of tropical reefs. In many Pacific atolls, red algae have contributed far more to reef structure than other organisms including corals (2). In the oceans, various species of red algae are primary producers eaten by fish, crustaceans, worms and gastropods.
Red algae occupy the second basal branch in the green lineage following the Glaucophyta algae (3). Some red algal species have important evolutionary value for studying basic biological questions such as the origin of multicellularity (4), symbiosis (5) and evolution of photosynthesis. There are about 6000 species of red algae [Source: AlgaeBase (6), www.algaebase.org], ranging from singlecelled species to complex, multi-cellular, 'plant-like' organisms. They are also excellent material to study symbiosis, since many are inexorably associated with other organisms. Some species are used to produce agars, which are gelatinous food additives and in science labs as a support substance in culture media (7).
The current available red algae related data, such as those included in AlgaeBase (www.algaebase.org) and Porphyra website (http://www.porphyra.org/), are limited to morphological descriptions. The integration of genome data and morphological data is in its beginning stage. For instance, the comprehensive database phytozome V12 (phytozome.jgi.doe. gov/pz/portal.html), plant genome duplication database (PGDD, chibba.agtec.uga.edu/duplication), plant genome database (PlantGDB, plantgdb.org) (release V187) and plant genome and systems biology (PGSB, pgsb.helmholtz-muen chen.de/plant) database do not include any red algae genome. The pico-Plaza 2.0 (bioinformatics.psb.ugent.be/plaza/ver sions/pico-plaza/) database include one red algal genome, while CoGe database has two and Ensembl Plant database (plants.ensembl.org) has three ( Figure 1). This dearth of information leads to the underestimation of the biological importance of red algae. Comparative analysis of red algae species, such as the evolutionary studies of genes families (8,9), noncoding genes and small RNAs (10) lag far behind in plant science, partly because of the difficulty in obtaining red algae genomic information. In breeding, an open platform integrating various omics data and species information is the demand for scientists and breeders (11).
To meet the ever-rapid increasing amounts of genomic and transcriptome data and their tremendous potential in understanding developmental process (12) and to assist molecular breeding, we build an online, searchable platform for integrating the ome data with the use of multiple omics tools. The Information gained will be a valuable to boost the understanding red algae genomes and the evolution of plant genomes.

Data description
Dataset Sequences from six genomes (including partial genomes) and seven transcriptomes, together with annotation data were downloaded directly from public available websites. These data were shared freely on these websites, which provided no or little online analysis tools. Raw reads of four genomes and 20 transcriptomes were downloaded from NCBI-SRA (www.ncbi.nlm.nih.gov/sra) database without annotations, thus were de novo assembled and annotated in this study (Supplementary Table S1). The red algal transcription factors were predicted in this study, relying on the HMMsearch tool from the HMMER software (hmmer.org) with default parameters and homology seeds from Pfam database (pfam.xfam.org).

Assembly and annotation of genomes and transcriptomes
All the original reads from the downloaded raw data were filtered using Trimmomatic (13) (https://github.com/tim flutre/trimmomatic) to remove the adapters and low-quality reads. These clean reads were then de novo assembled using the software Trinity (14) (https://github.com/trinityrnaseq/ trinityrnaseq). Trinity produced the transcriptome files in FASTA format and the assembled sequences were then used for gene identification. TransDecoder was integrated in Trinity software and was employed for detecting gene regions (https://github.com/TransDecoder/TransDecoder/). Kyoto Encyclopedia of Genes Genomes (KEGG) and Enzyme Commission data were both obtained by BLAST genes with the KEGG database (https://www.kegg.jp/kegg/).

Database construction
The realDB database employs Aliyun, one of the largest cloud server providers in the world, thus facilitates realDB outstanding advantages such as (i) scalability in easily expanding its storage size and computing ability, (ii) more stability and (iii) simple to maintain. The realDB relies on the Linux Ubuntu Server 14.04.4, Apache2.4.18, Java (version 1.8) and Java Server Page (JSP) 2.0. realDB provides an efficient and friendly interface for users to access a multitude of red algae data, which displays a simple and direct homepage. The searching system was created using PHP 7.0.22 and MySQL 5.7.20 software.

Results and discussions
An updating timeline for the sequenced red algae To attract more visits to our online platform, we created an updating timeline system on the homepage of realDB that updates the recently sequenced genome or transcriptome of red algal species ( Figure 2). This timeline system consists of >1000 lines of code adapted from vis.js (http:// visjs.org/), dedicated to providing multiple forms of information, including the release time, genome size, reference and authors. User can click the hyperlink to browse the reference or related linked websites for additional information. The defining feature of this timeline tool is its dynamics and interactive features with species information. Users can move the timeline space and zoom in or zoom out of the timeline by dragging and scrolling in the species timeline zone. The time-scale on the axis is adjusted automatically (http://almende.github.io/chap-links-library/ graph.html), supporting scales ranging from milliseconds to years. We will create new items when genome or transcriptome from other red algal species become available.

Bootstrap boosted framework for various display facilities
Users are able to check our genome updates and news via mobile phone using the Bootstrap framework, which is the world's leading framework for building responsive, mobile-first sites (15). Users are able to check updates of genome releases or website news of realDB via mobile, ipad, laptop and desktop using all popular web browsers including Google Chrome, Safari, Firefox, Internet Explorer, etc. without any display difficulty ( Figure 2).

Various introductions to red algae for wide readership
The lack of red algal genome sequences in various databases is partly due to the limited knowledge of red algae. The molecular biological studies of red algae provide many useful results, such as information on systematics, physiology, ecology and evolution. Concise introductions to each species assist visitors with different backgrounds to quickly decide which species to analyze. Most of the information for each species was provided and cited from the book 'Red Algae in the Genomic Age' (16), including descriptions of life histories, forms and styles, genomic information and data sources. We also provided the description and classification of red algae on the website because general researchers and comparative genomic biologists usually do not have extensive knowledge of red algae classification or morphology. realDB covers the largest number of red algae with ome data The current realDB V1.0 gathered 10 available genomes and 27 transcriptomes, representing all the 7 classes in the Rhodophyta. Among this dataset, we de novo assembled the genomes of Galdieria phlegrea, Gracilariopsis lemaneiformis and Porphyridium cruentum, and 18 transcriptomes (Table 1). realDB has provided 37 ome datasets, including genome and transcriptome sequences ( Figure 2). In comparison, the green lineage oriented genome database Phytozome (phytozome.jgi.doe.gov) does not contain any genome/transcriptome data of red algae. Furthermore, the algae-oriented database Pico-PLAZA (bioinformatics.psb.ugent.be/plaza/versions/pico-plaza) harbors only one red algal genome, and the plant-specific database Ensemble Plant (plants.ensembl.org) has included only three red algal genomes (Figure 2). In realDB, we selected Chondrus crispus, Cyanidioschyzon merolae, Galdieria sulphuraria, G. phlegrea as flagship red algae with the best genome sequencing and assembly.

A suit of toolbox for online analysis
Besides the downloadable dataset, online tools would facilitate data retrieval and comparative analyses. Currently, realDB provides a complete suite of BLAST tools (Figure 3) consisting of BLASTn, BLASTx, BLASTp, tBLASTn and tBLASTx. This BLAST suit was constructed using the sequenceserver tool (www.sequen ceserver.com/). A list of 21 advanced parameters such as -evalue 1.0e-10 -max_target_seqs 10 are optional for searches. For the nucleotides, coding sequences (CDS) and genomes were separated and could be individually selected. Users will find the GenBank style formatted BLAST results easy to use and download hits in FASTA format, and align data in tab-delimited or XML formats.
JBrowse was incorporated into realDB (Figure 3), allowing users to instantly browse, visualize, and retrieve sequence data. Currently, we provided the Jbrowse tool for C. crispus, C. merolae, G. sulphuraria, which are well assembled and annotated genomes. Using Jbrowse tool, users can easily browse and analyze these genomes at various scales with a graphic interface. Detailed gene information could be conveniently viewed and fetched by zooming in and out the interested genomic region, to view the information such as location, annotation and sequences by clicking on the corresponding tracks. The search tools in realDB provide a series of search service for CDS, protein, gene annotation, gene family, transcription factors and miRNA information (Figure 4). These information will be useful for both wet lab and dry lab biologists. Gene families, especially transcription factor families, control various physiological processes and are breeding targets (17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27). miRNAs have been extensively studied in land plants and green algae (12,28,29). However, little is known about its function and evolutionary trajectory in red algae. We incorporated four miRNA datasets that have been experimentally validated from Porphyridium purpureum (30) (Porphyridiophyaceae), C. crispus (31) (Florideophyaceae), Eucheuma denticulatum (32) (Florideophyaceae), P. yezoensis (33) (Bangiophyaceae) into realDB. Users can easily discover a miRNA and related information through our search system.

Conclusion and future perspectives
Red algae (Rhodophyta) have a critical place in plant evolution as the second branch after Glaucophyta, attracting thousands of scientists in areas of ecology, evolution and genomics. They are also attractive to people working on bioengineering, medicine and food science. These study of red algae is facing the rapid development of genomics. Facilitated by low-cost and fast sequencing technologies, more and more red algae have their genomes and transcriptomes sequenced. realDB is dedicated to being the leading platform for analyzing red algae genomes by providing the latest omics data and oneline analysis tools. Currently, we provide the most genome and transcriptome data for 37 red algae that are freely available to all researchers. The realDB Version 1.0 database is the first release and will be updated when new datasets are available. Furthermore, we will incorporate additional bioinformatics tools for easier data access and online analyses. Since its release in September 2017, realDB has attracted the attention of scientists from around the world, and the website has been visited by researchers from 27 countries (April 2018). All people interested in realDB are encouraged to contact us for data sharing and collaboration. We are dedicated to collaborating with international teams to collect more data and develop more tools, hoping to make realDB the most influential database for red algae studies.

Supplementary data
Supplementary data are available at Database Online.
Funding F.C. is supported by a grant from natural science foundation of Fujian Province (2018J01603) and a grant from State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops (SKB2017004). L.Z. is supported by the National Natural Science Foundation of China