gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes

In mammals, approximately 10% of genome sequences correspond to endogenous viral elements (EVEs), which are derived from ancient viral infections of germ cells. Although most EVEs have been inactivated, some open reading frames (ORFs) of EVEs obtained functions in the hosts. However, EVE ORFs usually remain unannotated in the genomes, and no databases are available for EVE ORFs. To investigate the function and evolution of EVEs in mammalian genomes, we developed EVE ORF databases for 20 genomes of 19 mammalian species. A total of 736,771 non-overlapping EVE ORFs were identified and archived in a database named gEVE (http://geve.med.u-tokai.ac.jp). The gEVE database provides nucleotide and amino acid sequences, genomic loci and functional annotations of EVE ORFs for all 20 genomes. In analyzing RNA-seq data with the gEVE database, we successfully identified the expressed EVE genes, suggesting that the gEVE database facilitates studies of the genomic analyses of various mammalian species. Database URL: http://geve.med.u-tokai.ac.jp


Introduction
Approximately 10% of mammalian genome sequences correspond to endogenous viral elements (EVEs), including endogenous retroviruses (ERVs), which are thought to be derived from ancient viral infections of germ cells (1)(2)(3)(4). In general, most EVEs have been inactivated by insertions, deletions, substitutions and/or epigenetic modifications. For this reason, they were once thought solely as the legacies of ancestral viral infection, so that they remain unannotated even if they contain open reading frames (ORFs). However, various ORFs of EVEs are still active and express viral proteins in hosts, some of which have been found to play important roles in mammalian development. For example, proteins that were originally derived from envelope proteins of retroviruses-many of them are called syncytins-are known to be involved in placental development in various mammalian species (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16).
EVEs are unique in that their evolutionary histories differ among mammalian lineages. Various mammalian species have different syncytin genes that show similar molecular functions, but those have been acquired independently in each lineage during mammalian evolution (17,18). For example, human syncytin-1 and -2 were captured in the ancestral lineages of Catarrhini and Simiiformes (6,19), respectively and mouse syncytin-A and B were captured in the ancestral lineage of Muridae (7). Although this unique evolution of EVEs might have contributed to maintain genetic basis of mammalian traits, it is sometimes problematic for a comprehensive discovery of functional EVEs in mammalian genomes.
At present, there are no integrated databases of EVEs. Previously, EVE (ERV) databases for human and mouse genomes were constructed as HERVd (20) and ERE database (21), respectively. However, these databases have several problems (summarized in Table 1). For HERVd (http://herv.img.cas.cz), the reference human genome sequence is out of date, and the database is apparently not maintained, as its last update was on September 19, 2003. ERE database is not a web-based database and requires Microsoft Windows. Neither database provides ORFs for each EVE sequence. Further, no computational programs for EVE detection can identify EVE ORFs comprehensively in a given genome sequence. RetroTector (22) is a wellknown computer program that can identify EVE sequences in a given genome sequence, but it has been reported to be unable to identify some EVE sequences (23). RepeatMasker (24) with Repbase (25) is another wellknown system for detecting EVEs. However, it was originally developed as a 'masking' tool for repetitive sequences in a given genome, and cannot annotate ORFs originating from viruses. Although there are no established programs for EVE ORF detection, a combination of these programs and databases, as well as sequence similarity searches using endogenous and exogenous viral sequences, can be used to identify comprehensive sets of EVEs in a genome.
To investigate the function and evolution of EVEs in mammalian genomes, we developed a genome-based EVE database named gEVE (http://geve.med.u-tokai.ac.jp) using 20 genomes of 19 mammalian species (Table 2). We comprehensively identified and annotated EVE ORF sequences (i) encoding >80 amino acid (aa) sequences and (ii) harboring viral sequence motifs. The sequences and annotations of all EVEs can be downloaded from the database without registration. Our new annotations of EVE ORFs will offer a useful resource which enhances studies of EVEs, such as expression analysis using next-generation sequencing (NGS) data, facilitating studies of functional EVE sequences in various mammalian species.

Statistics and annotation
The procedure used to identify sequences derived from viral infection is summarized in Figure 1. We first applied RetroTector version 1.01 (22) and RepeatMasker version 4.03 (24) with RMblast (version 2.2.28) and RepBase (25, version 20140423) to each genome sequence ( Figure 1A, STEP1). We used default parameters for each search program excluding RepeatMasker with the '-species' option depending on the target genome: human, mouse, rat, cow, pig, cat, dog, or mammal. For each identified candidate region, we scanned all possible codon reading frames, three in each direction (i.e. six frames). If the longest reading frame in the region does not contain any stop codons encoding >80 amino acids (aa), the amino acid sequence was searched by using HMMER 3.1b1 (hmmer.org) with viral motif profiles as illustrated in Figure 1A STEP2. Hidden Markov models (HMMs) of the viral motif profiles used in this process were downloaded from the Pfam (26)  Supplementary Table S1). Each ORF having at least one HMM profile hit was stored in the database for the corresponding genome. Note that we used an arbitrary minimum ORF cut-off of 80 aa to reduce the number of falsely extracting non-coding RNAs as EVE ORFs (28). In our annotation, ORF sequences missing a start codon (ATG) are also defined as ORFs because these sequences could work as exons in a spliced transcript.
Next, to retrieve EVEs that are missed by the two computational programs, we performed similarity searches using BLAT (29) against each genome ( Figure 1B, STEP3) using the following amino acid sequences: (i) all viral sequences encoding proteins stored in the NCBI RefSeq database (viral.1.protein.faa, version July 10, 2014), (ii) 131 known EVE genes (see Supplementary Table S2) and (iii) all 774 172 EVE sequences identified in the STEP 2. We then summarized EVE ORF sequences with viral motifs and encoding >80 amino acids by removing overlapping sequences while accounting for reading frames ( Figure 1B, STEP 4). The number of EVE ORF sequences for each gene annotation is shown in Table 2 and the gEVE database (see 'About' page).
To further annotate each EVE ORF sequence, we conducted BLASTP searches separately against (i) all viral protein sequences (viral.1.protein.faa, version July 10, 2014), (ii) the non-redundant protein database (nr, version June 26, 2014) and (iii) known EVE sequences (see Supplementary Table S2). For each EVE gene, a description of the best hit was stored in the database. The number of best hits against all viral protein sequences for each genome is summarized in the gEVE database (see 'About' page). We also examined the correspondence between 131 known EVEs and sequences in the database (Supplementary Table S2 and 'About' page of the gEVE database). Additional annotations such as overlaps between exons of all annotated genes and our EVE sequences are provided in 'Annotation Datasheet' of the gEVE database. Detailed annotations are presented in the next section 'Service and data download'.
In the database, we employed a naming system for each EVE ORF sequence based on the genome sequence and the EVE location, using a combination of genome ID, chromosome number, 5' position, 3' position and coordinates (þ or -). For example, a gEVE ID of Hsap38.chr1.100259758.100261128.-indicates that the EVE ORF is located on chromosome 1 of the human genome (version GRCh38) from positions 100 259 758 to 100 261 128 (on the negative strand). With this system, all EVEs have a unique ID for each genome.

Service and data download
All EVE sequences and their annotations for the 20 mammalian genomes are available in the database. Annotation tables are displayed with optional searches (such as species, chromosomes, amino acid lengths and HMM profiles) and can be downloaded as tab-delimited text files (Figure 2). Annotation tables include the following information: ID, gEVE ID (genome ID, chromosome, start, end and strand); Amino acid length; method, method used for EVE identification; Number of N letters, the number of Ns (undetermined nucleotides) in the region; MetORF ID, ID for EVE starting with methionine; Amino acid length of MetORF ID; HMM profile, significant motif profile(s); Viral BLAST, BLASTP best hit(s) against the NCBI Viral Genome Database (viral.1.protein.faa, version 07/10, 2014); NR BLAST, BLASTP best hit(s) against the NCBI nr (non-redundant) database; and EVE BLAST, BLASTP best hit(s) against known EVE sequences; RetroTector, annotation by RetroTector (22); Repbase, annotation by RepeatMasker with Repbase database (24,25); Overlapping, overlaps between EVE sequences and all annotated genes in the NCBI/UCSC/Ensembl databases. IDs, BLAST results and overlapping genes are linked to NCBI/UCSC/Ensembl resources depending on their contexts. Visible annotation column can be selected using 'Display' option ( Figure 2b). Annotation search tools are also available (Figure 2c). FASTA files of nucleotide and/or amino acid sequences and annotation tables of selected EVE sequences can be downloaded via the website ( Figure  2d). The bulk download of all the EVE ORF sequences and their annotations is available in the 'Download' page. Further, the BLAST search is implemented in the gEVE database powered by SequenceServer (30) so that any sequences of interest can be searched online against all sequences in the gEVE database.

Application of the gEVE database
As described in Introduction, one of the difficulties in EVE analysis is the lack of conservation in sequences among mammalian lineages. We thus demonstrated phylogenetic analysis as an example of gEVE database application (Figure 3). Human syncytin-1 amino acid sequence was used to perform BLASTP searches against all EVE sequences in gEVE database with e-value <1e-40. Then, a maximum likelihood phylogenetic tree was constructed using RAxML version 8 (31). We obtained syncytin-1 genes in all apes as reported by Kim and his colleagues (19), and we also found syncytin-1 like sequences in nonhominid primates, rodents and even in cows, goats, dogs and cats. Interestingly, known annotated syncytin genes in cows, goats, dogs and cats are different from these syncytin-1 like sequences. This result does not directly indicate that all these syncytin-1 like sequences are really functional. However, we can easily know when these syncytin-1 like sequences were integrated in mammalian genomes. The phylogenetic analysis using gEVE database can help researchers to save time to obtain EVE ORFs in mammalian genomes and to select species for further comparative analysis.
The most powerful application of gEVE database is in NGS analyses. We also provide a General Transfer Format (GTF) file for EVE gene loci of each genome stored in the gEVE database (see 'Download' page). Using these GTF files with NGS data, dynamic expression profiles of EVE genes can be examined. For example, the RNA-seq data of human placenta expression (ID: ERR315374) stored in the sequence read archive (SRA, http://www.ncbi.nlm.nih.gov/ sra/) were examined. The FASTQ sequences were obtained and mapped onto the human genome (GRCh38) using . Phylogenetic tree of syncytin-1 like sequences. All sequences over 400 amino acids were extracted from BLASTP hits with e-values <e-40, and the tree was built with RAxML (31) with substitution model (JTT þ G þ I) determined by ProtTest3 (32). Bootstrap values are shown on the node (1,000 replicates). Known syncytin-1 and -2 genes in primates are indicated by the bar on the right. External nodes show EVE IDs (see Table 2 as well).
TopHat2 (33). The expression levels of EVE sequences were computed using Cufflinks (34) with the GTF file of gEVE Hsap38. The top 10 EVE sequences showing biggest FPKM values (i.e. highly expressed EVE sequences) are summarized in Table 3. We successfully identified known EVEs expressed in human placenta-PEG10 (35), suppressyn (10), syncytin-1 (5) and syncytin-2 (6)-as well as novel EVE sequences. This result shows that NGS data analyses combined with our annotation data enable us to discover hidden functional EVE sequences in genomes.

Future perspectives
We developed the gEVE database to provide EVE sequences coding >80 aa in the 20 mammalian genomes. In other words, our current database does not yet support non-coding sequences derived from EVEs. Accumulating reports indicate the functional importance of non-coding EVE sequences in host species, such as long terminal repeats (LTRs). Some LTRs in humans (such as LTR7) retain functional promoter-enhancer activity and control stem cell potency of embryonic stem (ES) and induced pluripotent stem (iPS) cells (36). Furthermore, various long non-coding (lnc) RNAs are expected to be derived from non-coding EVE sequences, which are also functional in host species (37). Thus, another task of gEVE database is to add more detailed annotation for EVE sequences. For example, evolutionary relationship among EVE sequences in the gEVE database has not been examined yet, although annotation of BLASTP best hits in the database would be partially useful. By addressing these points, the gEVE database will be continuously improved and expanded to contribute the further understanding of EVE sequences in the host genomes.

Supplementary Data
Supplementary data are available at Database Online.