AtFusionDB: a database of fusion transcripts in Arabidopsis thaliana

Abstract Fusion transcripts are chimeric RNAs generated as a result of fusion either at DNA or RNA level. These novel transcripts have been extensively studied in the case of human cancers but still remain underexamined in plants. In this study, we introduce the first plant-specific database of fusion transcripts named AtFusionDB (http://www.nipgr.res.in/AtFusionDB). This is a comprehensive database that contains the detailed information about fusion transcripts identified in model plant Arabidopsis thaliana. A total of 82 969 fusion transcript entries generated from 17 181 different genes of A. thaliana are available in this database. Apart from the basic information consisting of the Ensembl gene names, official gene name, tissue type, EricScore, fusion type, AtFusionDB ID and sample ID (e.g. Sequence Read Archive ID), additional information like UniProt, gene coordinates (together with the function of parental genes), junction sequence, expression level of both parent genes and fusion transcript may be of high utility to the user. Two different types of search modules viz. ‘Simple Search’ and ‘Advanced Search’ in addition to the ‘Browse’ option with data download facility are provided in this database. Three different modules for mapping and alignment of the query sequences viz. BLASTN, SW Align and Mapping are incorporated in AtFusionDB. This database is a head start for exploring the complex and unexplored domain of gene/transcript fusion in plants.


Introduction
The origin and evolution of new genes are the constant sources of evolutionary renovation and adaptation. Gene duplication, the de novo origination of gene, transposition, fission and fusion are the major processes leading to the genesis of new genes (1,2). Fusion transcripts illustrate an event in which a hybrid RNA is composed of transcripts from two separate genes (3). This can be accomplished by translocation of the original genes at the DNA level or post-transcriptionally during splicing events, and it has been documented in diverse life forms (4). The formation of fusion transcripts can occur either by gene or chromosomal rearrangements (gene fusion at DNA level by translocation, deletion and inversion) or by intergenic RNA cis-splicing and trans-splicing events, i.e. transcript fusion at RNA level (5)(6)(7). The fusion transcripts formed post-transcriptionally are more common (8). The splicing event is said to be of 'cis-type' when the two exons derived from two neighboring genes transcribe simultaneously or 'trans-type' when the two exons originated from two separate premature mRNAs. Nonetheless, both of these aforementioned types of splicing events are mediated by spliceosome complex (9). Read-through fusion transcripts are generated by fusion of the two adjoining individually spliced genes in the same orientation and from the same strand, resembling alternative splicing (10,11). Likewise, the fusion transcripts that are originated as a result of the hybridization of transcripts from two nearby genes located in opposite strands give rise to cis-acting chimeric transcripts (12). Intra-chromosomal fusion transcripts are generated by fusion of genes or transcripts coming from the same chromosome while interchromosomal chimeric transcripts are formed as a result of gene or transcript fusion from different chromosomes (13). The different fusion transcript types viz. read-through, cis-acting, intra-chromosomal and inter-chromosomal transcripts have been illustrated in Figure 1.
The existence and impact of fusion transcripts at molecular and physiological levels have been studied in eukaryotes including Drosophila (14), zebrafish (15), plants (16)(17)(18) and humans (19,20). The role of gene fusion in promoting hematological and solid cancers has been well established in humans (21,22), and this has paved way to further inquire about their biological relevance in other organisms as well. The very famous and extensively studied BCR-ABL1 fusion transcript is involved in promoting malignancy in the case of chronic myelogenous leukemia (23). These fusion transcripts have been exploited as biomarkers in cancer prediction and targets of molecular therapeutics (24). Chimeric transcripts may either act as long noncoding RNAs or can encode novel chimeric proteins (25), thus can alter cellular signaling and overall functioning in diverse organisms. The emergence of high-throughput technologies has led to the accumulation of enormous sequencing data, which has eased the understanding of the molecular mechanism behind this complex event, and its implications are being attempted to be elucidated in eukaryotic organisms including plants (21). The currently available fusion transcript databases such as ChiTaRS (26), FusionCancer (24), ChimerDB (27), Mitelman Database (28) and FusionHub (29) harbor information related to fusion transcripts reported in human cancers, mouse and flies. Till date, only scarce knowledge about fusion transcripts is available for plants (30)(31)(32). A freely available fusion transcript database of plants is currently unavailable to our best of knowledge. In this study, we have developed a database named AtFusionDB which is the plantexclusive knowledge base for fusion transcripts predicted in the model plant Arabidopsis thaliana (the thale cress or mouse-ear cress). The overall structure and major elements of the AtFusionDB are highlighted in Figure 2. Gene fusion is believed to be a major factor for controlling morphology, physiology and phenotypic character in an organism as well as a major contributor for adaptive evolution. Thus, this attempt will unravel new directions for exploring the impact and consequences of gene-fusion events in the plant kingdom and elucidating the significance of shuffling and fusion of transcripts on the physiology of plants.

Identification of fusion transcripts
The FASTQ files of paired-end RNA-Seq run obtained from the previous step were given as an input to 'EricScript-Plants' (https://github.com/asherkhb/EricScript-Plants) for the identification of fusion transcripts in A. thaliana. EricScript-Plants is a modified version of EricScript (33) to work for all the plant species available at Ensembl Plants (34). This version of EricScript was downloaded using the command 'git clone' (https://github.com/asherkhb/ EricScript-Plants.git). EricScript is a freely available software package (https://sites.google.com/site/bioericscript) for the identification of fusion transcripts from paired-end RNA-Seq data sets. It is developed in Practical Extraction and Reporting Language (PERL) and requires several other dependencies, i.e. R (http://cran.r-project.org/), ada package (http://cran.r-project.org/web/packages/ada/index.html), BWA (35), SAMtools version >0. 1.17 (36), bedtools version >2.15 (37), BLAT (38) and seqtk (https://github. com/lh3/seqtk). The major limitation of this script is the use of transcriptome instead of a reference genome for the mapping of sequencing reads. The output of the EricScript-Plants reports the candidate fusions in two tab-delimited files; the first file (e.g. samplename.results.total.tsv) contains all the identified fusions, whereas the other file (e.g. samplename.results.filtered.tsv) reports the fusions with 'EricScore' >0. 5. In a quality-filtering approach, EricScript  exploits three different scores viz. genuine junction score, edge score and uniformity score. Using an AdaBoost qualifier (39), these three scores are unified into a single score, called 'EricScore', which assigns each candidate fusion a probability score of 'well' pattern, and thus classifying all the fusions for discriminating between real transcripts and false-positive events (33). The value of 'EricScore' determines the probability of fusion transcripts to be real, with the score ranging from 0.01 to 0.99. The transcripts with the highest EricScore represents the highest possibility to be fusion transcripts. In order to eliminate the false positives, we have only considered the fusion transcripts whose 'EricScore' values were >0.5 (e.g. results in the file samplename.results.filtered.tsv). The detailed explanation of the EricScript output files is given at the web page (https://sites.google.com/site/bioericscript/).
The Bioconductor (https://www.bioconductor.org/) packages, SRAdb (40) and GEOmetadb (41) were utilized for the retrieval of tissue-type information of RNA-Seq samples analyzed in this study and combined with each entry of AtFusionDB.

AtFusionDB web interface development
After the compilation of all the information, AtFusionDB web interface was developed using Hypertext Markup Language, Cascading Style Sheets, Structured Query Language, Java scripting language, PERL and Hypertext Preprocessor on Apache Hypertext Transfer Protocol server. The gene coordinates of each individual gene were prepared using the chromosome no., breakpoint and strand sense information separated by '|' like chromosome no. | breakpoint | strand (e.g. 5 | 25563 | +).

Database organization
The entire data stored in AtFusionDB are organized at different levels. At the most basic or primary level, the user can search by using simple keywords such as 'gene name', 'chromosome', 'tissue', 'fusion-name', 'SRA-ID', 'AtFusionDB-ID' etc. as per the requirement. The information will be displayed in tabular form according to the number of display fields selected by the user. The secondary data can be accessed to gain further information on sequencing experiments and detailed information on fusion transcripts by clicking on the hyperlinks of 'SRA-ID' and 'fusion' on search result pages, respectively. At the tertiary level, additional information about the contributing genes giving rise to chimeric transcripts can be accessed by clicking on the hyperlinks of the UniProt ID(s) and EnsemblPlants ID(s) on the fusion information page. We have made efforts to make the database easy and convenient to access and fetch the information supplemented with downloadable links.

AtFusionDB web interface features
AtFusionDB provides the two user-friendly 'Search' options viz. 'simple' and 'advanced' for searching fusion transcript information by using different types of keywords. The 'Simple Search' option facilitates the user to fetch fusion transcript information by providing different search terms like gene name, chromosome number, tissue etc. To provide flexibility, two options i.e. 'containing' and 'exact' have been incorporated for search terms. This option also facilitates the user to select the fields to be displayed. The 'Advanced Search' option provides the facility to make the user-built query using up to 11 different combinations of keywords. The keywords (e.g. fusion, tissue, chromosomes, genes, 'EricScore' etc.) can be defined to be included together or searched alternatively or excluded using 'add' and 'remove facility'. The conditional operators viz. '=', 'Like' and '!=' and two logical operators 'OR' and 'AND' can be used as per the need of the users. For convenient browsing, 'Browse' section is also available for the user. This section enables the user to browse the database by the following categories: fusion type, tissue type, 'EricScore' range, chromosome and frequency. The frequency of occurrence of fusion transcripts can be further browsed with respect to the tissue type, condition and fusion type separately.
The 'Tools' section of AtFusionDB facilitates the user to extract useful information regarding fusion transcript by providing input query to the following tools: 'SW Align', 'Mapping' and 'BLAST' (42). 'SW Align' allows the user to align their query sequence with the fusion transcript junction sequences available in AtFusionDB database. This option helps the user to identify and characterize their sequence of interest. Here, we have incorporated 'WATER' utility of EMBOSS-6.6.0 package, following the Smith-Waterman Algorithm (43). 'Mapping' option facilitates the user to map all the fusion junction sequences from AtFusionDB database to the gene sequences as query provided by the user and only those sequences from AtFusionDB matching 100% with the query sequences are displayed. This module is useful for the detection of fusion transcripts in newly assembled genome drafts or novel gene sequences. In this module, we have incorporated BLASTN (44) option of the BLAST software package. 'BLAST' module is helpful to find the regions of similarity between the user input FASTA sequences and AtFusionDB database sequences using BLASTN with the option to change Expect value (E value). The respective ID(s) of sequences from AtFusionDB producing significant alignments with the query sequences are further hyperlinked to display their detailed information.  (16) AT4G35450 AT4G35460 (7) AT4G25290 AT4G25280 (7) ATCG00810 ATCG00800 (6) ATCG00190 ATCG00180 (6) AT1G07590 AT5G10100 (88) ATCG00480 ATCG00470 (59) ATCG00810 ATCG00800 (52) AT3G02080 AT5G10100 (46) AT3G59330 AT3G59320 (12) AT4G14960 AT1G50010 (11) AT3G61780 AT5G28400 (10) ATCG00580 ATCG00570 (9) The 'Method' section explains the pipeline opted for the identification of fusion transcripts. The 'Statistics' page graphically represents the total and unique fusion transcripts incorporated in AtFusionDB on the basis of fusion and tissue types. The user can also visualize a pie chart depicting the distribution of unique and common fusion transcripts found in SRA samples. 'Help/Guide' section is useful for the user to understand AtFusionDB database and use it effectively.

Results and discussion
We have downloaded and analyzed a total of 4697 pairedend RNA-Seq data sets of A. thaliana. We could not found the fusion transcripts in 1036 samples because of different logistic reasons (e.g. the quality of data, the quantity of data, the absence of fusion transcripts etc.). Out of remaining 3661 samples, 141 samples have only the fusion transcripts with low EricScore (e.g. <0.5), not considered for further analyses. Finally, the most probable fusion transcripts (with EricScore >0.5) from 3520 samples were incorporated in AtFusionDB database. These 82 969 fusion transcripts with 'EricScore' >0.5 were considered for further data processing and refinement in AtFusionDB. Altogether, 17 181 genes were involved in fusion transcript generation. Rankwise distribution of total predicted fusion transcripts in accordance with 'EricScore' range has been listed in the table available at the 'Statistics' section of the database.  A total of 82 969 fusion transcript entries of AtFusionDB are represented by 71 920 unique fusion transcripts. A total of 41 838 fusion transcripts were nonrecurrent and found only in one RNA-Seq sample. However, numerous transcripts were observed to be common in two or more than two samples that is graphically represented in the 'Statistics' section of AtFusionDB. The fusion transcripts (total and unique) were also categorized and distributed on the basis of their aforementioned fusion types. It was noticed that inter-chromosomal transcripts were the most abundant and cis-acting transcripts were least in number. The graphical representation showing their distribution has been provided in the Statistics section of AtFusionDB. The three most frequently occurring (e.g. recurrent) fusion transcripts in all analyzed samples were ATPB ATPE, RPL22 RPS3 and PSBE PSBF. All of these transcripts were of read-through type and originating from chloroplast. Multifold expression of these specific fusion transcripts in contrast to the significantly low expression of their individual contributing genes indicates that they might have a distinct role in governing cellular dynamics.
The tissue-wise study of the total fusion transcript entries, as well as unique chimeric transcripts, was also carried out on different tissues and developmental stages in the life cycle of A. thaliana. All the 3520 samples predicted with reliable fusion transcripts were distributed on the basis of their developmental stages and tissue origin. It was observed that most chimeric transcripts were derived from seedling, seed, root and whole plant RNA samples. It was noted that few genes commonly contributing in chimeric transcripts generation in the majority of tissue types were elongation factor 1-alpha 3/4 (A1 A1), chloroplastic ATP synthase subunit beta and ATP synthase epsilon chain (ATPB ATPE) and Chlorophyll a-b binding protein 3 (LHCB1.1 LHCB1.3) and Tubulin alpha chain and Tubulin alpha-2 chain (TUBA6 TUBA4). The most frequently occurring fusion transcripts on the basis of fusion type and tissue origin along with their respective frequencies are shown in Table 1. The tissuewise distribution of SRA samples is graphically represented in Figure 3. The samples were also categorized on the basis of different experimental conditions of abiotic and biotic stresses together with their respective frequency as demonstrated in Table 2.
By comparing fusion transcripts from the study done on rice by Zhang and his team in 2010 (16), we found 31 fusion-contributing genes from AtFusionDB homologous to fusion-contributing genes in rice (Supplementary Data  1). Further, we also found two fusion genes viz. AK101547 AK121590 and AK121590 AK101547 from rice that was homologous to 18 fusion genes in our database (Supplementary Data Table 1.2). Similar comparative studies were also performed in Nicotiana tabacum (tobacco) and it was observed that 35 different genes from tobacco were homologous to 10 fusion contributing genes from AtFusionDB. It was noticed that the genes expressing ribosomal proteins, tubulin and glyceraldehyde-3 phosphate dehydrogenase were common in all three plants considered for the study, thereby indicating their vital roles as fusion transcripts in governing the physiology of plants. The BLAST data results and gene list supporting homology studies along with their respective functions have been provided in the supplementary files (Supplementary Data Tables 1.1, 1.3, 1.4 and 1.5). Thus, our study has confirmed the previous reports of the existence of chimeric transcripts in rice and also indicates the significance of these novel fusion transcripts in other plants as well.
Despite the continuous efforts from researchers for understanding the origin of life, the evolution of genes, genomes and organisms, innumerable questions related to the birth and evolution of genes, which are the structural and functional unit of life, still remains unanswered. However, the rise of Big Data Era has provided burgeons of sequencing data that can be exploited for a better understanding of diverse mechanisms of origin of new genes and the impact of fusion of genes on each and every aspect of growth, development, physiology and adaptive evolution of the parental organisms as well as their progenies. Although a plethora of gene fusion transcripts has been predicted, an in-depth study, validation and functional characterization of the transcripts together with their encoded products are yet to be accomplished. AtFusionDB is the first attempt to gather and store information related to fusion transcripts in plants. This database will make it easy to explore the significance of gene/transcript fusion in plants. It will prove to be beneficial for the biologists in gaining knowledge of this rarely explored domain in the plant kingdom.

Authors contributions
D.D. and A.S. developed the web interface of the database. D.D., A.S., S.Z. and S.K. collected and compiled the data and performed the analysis. S.Z. and S.K. wrote the manuscript. S.K. conceived the idea and coordinated the project.

Supplementary data
Supplementary data are available at Database Online.