FishTEDB: a collective database of transposable elements identified in the complete genomes of fish

Abstract Transposable elements (TEs) are important for host gene regulation and genome evolution. Consensus sequences of TEs can assist investigators in accelerating studies on TE origins, amplification, functions and evolution, as well as comparative analyses and prediction of TEs in different species. In evolution, physiology, ecology and heredity research, fish are important models. However, to date, no comprehensive resource for TE consensus sequences exists for fish. Here, we collected genome-wide data and developed a novel database, FishTEDB, including 27 bony fishes, 1 cartilaginous fish, 1 lamprey and 1 lancelet. De novo, structure-based and homology-based approaches were combined to detect TEs. The database is open-source and user-friendly, and users can browse, search and download all data. FishTEDB also provides GetORF, BLAST and HMMER tools to analyze sequences. Database URL: http://www.fishtedb.org/


Introduction
Transposable elements (TEs) are discrete DNA segments that can insert into new chromosomal locations by one of two mechanisms (1). TEs are typically divided into Class I ('copy and paste' style, retrotransposons) and Class II ('cut and paste' style, transposons) based on whether the intermediate they use to move is RNA or DNA (2). On the basis of sequence similarities and structural relationships, these classes can be further subdivided into orders and superfamilies. Retrotransposons are commonly grouped into five distinct orders: long terminal repeat (LTR), Dictyostelium intermediate repeat sequence (DIRS), Penelope-like element (PLE), long interspersed nuclear element (LINE) and short interspersed nuclear element (SINE). DNA transposons consist of four main orders: terminal inverted repeat (TIR), Helitron, Crypton and Maverick (3). TEs are commonly considered molecular parasites owing to their removable and reproducible characteristics. However, studies of TEs in the past several decades have shown that transposons can affect gene regulation, function and coding ability (4)(5)(6). Transposons also play important roles in new gene creation, chromosome rearrangement and genome evolution (7)(8)(9)(10)(11). Recently, the regulatory activities of TEs in both plants and animals have become a focus of research. For example, in the peppered moth, TEs enhance cortex gene expression levels, which underlies the adaptive coloration that occurred during the industrial revolution (12). In oil palms, sporadic demethylation of a Karma TE within an intron of the MANTLED gene caused the mantled fruit phenotype (13).
Fish are the largest and oldest group of vertebrates. Thus far, 33 700 species have been recorded in Fishbase (http:// www.fishbase.org/, version 10/2017), and this number is constantly increasing. Fish play a crucial role in modern biology. For example, zebrafish are not only model organisms for developmental biology but also a major disease research model (14,15). Lungfish and coelacanth, which have been described as 'living fossils', provide a unique opportunity to understand the mechanisms that enabled the successful adaptation of vertebrates to land (16,17). The content, diversity and distribution of TEs in fish genomes have been studied (18)(19)(20)(21); however, the functions and evolutionary significance of transposons in fish genomes are largely unknown. A comprehensive database of fish TEs is needed to facilitate studies of TE functions and evolution in fish genomes.
In this study, we identified 33 260 consensus sequences of TEs classified into 50 superfamilies from 28 fish species, 1 lamprey and 1 lancelet, using de novo, structurebased and homology-based approaches. We integrated all data into a centralized database, FishTEDB, which allows users to browse, search and download all data. In addition, the GetORF, BLAST and HMMER web-based tools were provided to facilitate analyses of genomic sequences. FishTEDB can be used not only to study the origin, amplification mechanism and evolutionary dynamics of TEs in fish, but also for comparative analyses among vertebrates to elucidate the roles of TEs on genes and genomes.

Collection and identification of TEs in fish genomes
TE libraries of fish were generated using de novo, homology-based and structure-based methods ( Figure 1). De novo identification of TEs was performed using RepeatModeler (http://www.repeatmasker.org/RepeatMo deler/, version 1.0.7), which assists in automating the runs of RECON (24) and RepeatScout (25) to analyze fish genomic databases, and the output of this software was used to build, refine and classify consensus models of putative interspersed repeats. Repeats identified by RepeatModeler were filtered for tandem repeat coverage of >25%, using Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.unix. help.html, version 4.07b) with the default parameters. The preserved sequences were used as queries for BlastX (identity > 30%, e-value < 1e-5 and percent query coverage > 50%) to search against Swiss-Prot data to filter protein-coding genes. We constructed a library of ncRNAs using tRNAscan-SE (version 1.3.1) (26) and Rfam (27) to filter tRNA and rRNA by Blastn (identity > 90%, BLAST e-value < 1e-5 and percent query coverage > 90%).
For the LTR and non-LTR retroelements, given their easier-to-detect structural peculiarities (3), a structurebased approach was used. For LTR retrotransposons, LTR_STRUC (28) and MGEScan-LTR (http://darwin.in formatics.indiana.edu/cgi-bin/evolution/daphnia_ltr.pl) were used to search the assembly of fish genomes with default parameters. For the MGEScan-LTR, intact LTR retroelements were identified using multiple empirical rules: similarity of a pair of LTRs at both ends, structure with internal regions (IRs), di (tri)-nucleotides at flanking ends and target site duplications (TSDs). We only retained the results that had these four structures. This framework was applied to identify a large number of novel elements, which were later analyzed to estimate the evolutionary history and relationships of LTR retrotransposons. Non-LTR retrotransposons were identified by the pHMMbased MGEScan-non-LTR (29) program with default parameters.
Given that Class II TEs lack easy-to-detect structural features, a homology-based method using TESeeker was employed to predict them. TESeeker is an automated homology-based approach for identifying TEs that is BLAST-based, but also makes use of the CAP3 assembly program and the ClustalW2 multiple sequence alignment tool, as well as numerous BioPerl scripts (30). In total, 257 transposase protein sequences from fish DNA transposons were extracted from RepBase and NCBI. These sequences were used as the library in TESeeker. Finally, we only retained the sequences with the highest quality in the consen-sus_contigs.fas file.

TE classification and redundancy elimination in fish genomes
When identifying TEs in fish genomes, some software (TESeeker, RepeatModeler, MGEScan-LTR) can classify TEs in superfamilies, but the classification of some sequences remains unknown. REPCLASS (version 1.0, https://github.com/feschottelab/REPCLASS) and TEclass (31) were used to classify these TEs. REPCLASS is the first software used for classification of TEs. It uses an automated high-throughput workflow model, leveraging various programs to identify and classify TEs in new genomes. REPCLASS can classify consensus sequences into superfamilies. TEclass uses a machine learning support vector machine (SVM) for classification based on oligomer frequencies to classify unknown TEs into DNA transposons, LTRs, LINEs and SINEs (31). Hence, for the consensus sequences that cannot be classified into a superfamily by REPCLASS, we used TEclass (http://www.compgen.unim uenster.de/tools/teclass/generate/index.pl?lang¼en) to classify them into orders.
In the step of TE prediction, we combined all of the results directly in a 'union' set of different types of evidence; therefore, the results contained redundant TEs that were predicted based on different methods. We reduced the presence of redundant sequences by CD-HIT (32) with parameters cd-hitest -c 0.90 and -n 8. Some transposons may insert in or next to other retrotransposons (especially in LTR), forming highly TE-rich regions (Nested TEs) (33)(34)(35). For example, some DNA transposons may insert into LTR. Normally, if all the results are put together for filtering, DNA transposons are filtered out because they are shorter than LTR. Thus, to prevent interference by nested TEs, we removed redundancies from the superfamily units one by one. We aligned the sequences that could not be classified into superfamily level ('Unknown' elements) to corresponding genomes by BLAST (identity > 85% and coverage > 50%), and only retained sequences with copy number > 3.

Implementation and web interface
To make this vast amount of TE data available, a user-friendly web-based database, FishTEDB, was  (Figure 2). FishTEDB was constructed using Yii 2.0 (a high-performance PHP MVC framework for developing Web 2.0 applications).
We used the Linux (CentOS 6.7) system as the server, Nginx 1.10 (a high-performance HTTP server and reverse proxy server) as the web server, Mysql 5.7 as the storage engine and PHP 7.0 for web development. Bootstrap 3.3,  JavaScript, Jquery and HTML5 were also used for the web page.

Browser
All TEs were displayed in the browsing interface in speciesand superfamily-centric manners. Users can browse by superfamily by clicking the corresponding number. Detailed information for each superfamily can be retrieved using the hyperlinks provided (Figure 2A). In the species-centric interface, all TEs were assigned to corresponding species. In both interfaces, the same method was used to browse TE data ( Figure 2B). Users can also use a keyword (TE class, TE order, TE superfamily, species name) to locate entries in the search section that used approximate string matching to implement ( Figure 3A). All data can be downloaded. In addition, we calculated the number of different superfamily sequences and displayed it with a pie chart and histogram (Figure 4).
i. BLAST was used for the homology search, and users can align interest query sequences against FishTEDB to make an incipient judgment (whether the query sequence is a TE and which type it belongs to). BLAST will act as an efficient helper for researchers to detect whether TEs exist in sequences upstream and downstream sequences of genes of interest. ii. Users can identify the potential open reading frame (ORF) in query sequences using the GetORF tool. Given that some TEs show differences (especially interspecies) even though they belong to the same superfamily, the results of the BLAST alignment may be deficient. GetORF can predict amino acid sequences (transposase, integrase, reverse transcriptase), and can be combined with BLAST and HMMER for TE identification and classification in species distantly related to fish at the nucleotide level. iii. HMMER was used for the identification of transposase, endonuclease and reverse transcriptase domains of transposons. All profile-HMM (profile hidden Markov model) databases were collected from previous study (29) and Pfam (39).
Examples of BLASTN, GetORF and HMMER results are shown in Figure 3B-D, respectively.

Results and discussion
In the seminal work of Barbara McClintock, TEs were proposed as the 'controlling elements' of maize (40). Since then, many researchers have paid close attention to the functions of TEs; however, to what extent the pervasive colonization of genomes by TEs has affected the evolution of eukaryotic gene regulation remains a matter of speculation and controversy (41). The evolution of fish began 530 million years ago during the Cambrian explosion (42). It was during this time that the early vertebrates developed the skull and the vertebral column, leading to the first vertebrates (43). Thus, supposing a TE mechanism, investigation of the roles of TEs in the genome evolution and the impact on host genes in fish may offer insights for other vertebrates. In this study, we constructed an effective combined pipeline, suitable not only for fish but also for other vertebrates. FishTEDB provides a good basis for TE functional studies and has an auxiliary role. First, FishTEDB can enrich the transposon data of vertebrates and promote transposon research. In particular, it would provide a homologous database for the identification and classification of TEs. Second, researchers can combine tools in FishTEDB with their own sequences to achieve rapid positioning of potential TEs. We identified 33 260 TEs from 30 species: 28 fishes, 1 lamprey and 1 lancelet. Most TEs were classified into known superfamilies (Table 2). In addition, the results suggest that TEs are diverse in fish genomes. In particular, the Gypsy, L1, L2, R2, RTE, Rex, Tc1-Mariner and hAT superfamilies showed higher diversity than other superfamilies. Nevertheless, fishes and lancelet presented a lower diversity of SINEs.
It should be noted that we only classified 60% of consensus sequences in superfamilies. There are still many TEs that cannot be classified into known superfamilies. The karyotypes and genome sizes in fish are more diverse and  complex than those of other vertebrates, and an extra level of complexity was observed due to whole genome duplication (WGD) and a rediploidization event that teleost fish have underwent during evolution (44). Therefore, we speculate that there are many fish-specific transposons, such as Zisupton (45). TE research is difficult without using a dedicated database. The transposon information of zebrafish in RepBase is probably the most comprehensive thus far, but that is still not sufficient to assist the classification of fish TEs. Nevertheless, these TEs may have potential effects on regulating host gene function and expression. In future studies, we will focus on the identification of novel superfamilies to further enrich TE data resources. Note. Numbers represent the number of consensus sequences and N indicates undetected.