Genome annotation, assisted by computer programs, is one of the great advances in modern biology. Nevertheless, the in silico identification of small and complex coding sequences is still challenging. We observed that amino acid sequences inferred from coding—but rarely from non-coding—DNA sequences accumulated alignments in low-stringency BLAST searches, suggesting that this alignments accumulation could be used to highlight coding regions in sequenced DNA. To investigate this possibility, we developed a computer program (AnABlast) that generates profiles of accumulated alignments in query amino acid sequences using a low-stringency BLAST strategy. To validate this approach, all six-frame translations of DNA sequences between every two annotated exons of the fission yeast genome were analysed with AnABlast. AnABlast-generated profiles identified three new copies of known genes, and four new genes supported by experimental evidence. New pseudogenes, ancestral carboxyl- and amino-terminal subtractions, complex gene rearrangements, and ancient fragments of mitDNA and of bacterial origin, were also inferred. Thus, this novel in silico approach provides a powerful tool to uncover new genes, as well as fossil-coding sequences, thus providing insight into the evolutionary history of annotated genomes.
The development of new methods for identifying protein-coding genes and pseudogenes in genome sequences is one of the major challenges in the genomic age. In addition to the extraordinary support given by experimental approaches, computational analysis is an essential task for protein-coding gene identification in present genome biology.1,2 Current gene prediction algorithms perform a number of limited but standard steps, including identification of open reading frames (ORFs) and predicted sequence analysis to assign putative functions by similarity to known proteins, motifs, and/or structures in databases.3–9 However, even though gene prediction and annotation software are increasingly efficient, small protein-coding genes and complex intron-containing genes may still escape detection by standard in silico analysis.
Recent experimental data from ‘omic-technologies’ generated in model organisms are creating the possibility to accurately update published genome annotations.10 The fission yeast Schizosaccharomyces pombe is one of these model organisms.11 The 12.5 Mb S. pombe genome was among the first genomes that were entirely sequenced. Based on standard computational prediction, together with experimental data, 4,824 genes and 31 pseudogenes were initially annotated.12 However, while >95% of the genome is transcribed,13–15 only 55% of the genome corresponds to annotated coding sequences,12,16 raising the possibility that more protein-coding genes await to be discovered. Accordingly, high-throughput cDNA sequencing (RNA-Seq) and mass spectrometry-based proteomic approaches14,17–19 increased this initial number to the 5,123 protein-coding genes that are currently annotated in PomBase (http://www.pombase.org), the S. pombe database.20 Some pseudogenes were also reassigned as genes, leaving 29 pseudogenes annotated at present. More recently, the use of ribosome profiling (Ribo-Seq) to investigate the translational landscape of meiosis added further 46 new protein-coding genes to the fission yeast genome.18 Most of these new findings correspond to genes coding for small polypeptides and/or coding sequences lacking significant alignments in databases, highlighting that the in silico identification of this set of coding sequences is still challenging in genome annotation.
The identification of similar proteins through BLAST analysis is one of the most useful strategies in annotation tasks. Finding significant alignments facilitates assignment of putative functions to query amino acid sequences through the identification of related proteins in databases. Interestingly, non-significant alignments (below the significant threshold) are often found in conventional BLAST searches. In polypeptides derived from the electronic translation of non-coding DNA, such alignments occur by chance. In polypeptides from coding DNA, however, in addition to random matches, non-significant alignments may also underlie small amino acid patterns representing random footprints of common ancestors.21,22 Since protein-coding genes mostly arise from previous ones during evolution, even highly divergent coding sequences may harbour ancient footprints of common ancestral proteins to be found among the millions of proteins available in databases. These footprints are not present in non-coding DNA sequences.22 Thus, alignments accumulated in predicted amino acid sequences might provide a method to discern coding from non-coding DNA, a new strategy that may overcome limitations of current in silico algorithms.
To use this strategy for the identification of coding regions at genome level, we developed a computer program, named AnABlast, which generates profiles of accumulated alignments in query amino acid sequences using a low-stringency BLAST strategy. To test its potential use as a gene-finder algorithm, a genome-wide search of coding sequences was carried out in fission yeast, and Ribo-Seq data used to validate protein-coding regions were highlighted by AnABlast profiles.18 Overall, several new genes and ancient coding sequences were efficiently identified by using AnABlast in the extremely well annotated S. pombe genome. The strategy was also successfully applied in a genomic region of the nematode Caenorhabditis elegans, suggesting that this novel tool may be applicable to the in silico analysis of any other sequenced genome.
Materials and methods
We have designed a computational workflow, here named AnABlast (Ancestral-patterns search through A BLAST-based strategy), to search for putative coding sequences in genomic regions. Briefly, to search for new coding regions in a genome, all inter-exon DNA sequences of the genome (intergenic and intronic regions) were translated in all six frames, masking stop codons with ‘X’ characters; in this way, the analysis become independent of traditional ORF signals. The algorithm then executes a BLASTP search with the six resultant amino acid sequences (as BLASTX, although it can be parallelized into six threads to accelerate the process) and the UniRef50 database (2014_02 version). The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from virtually all the published protein sequences,23 being the clusters from UniRef50 built with amino acid sequences having at least 50% sequence identity. AnABlast only uses the representative member from each UniRef50 cluster, and thus, it only considers non-similar sequences at 50% identity threshold (avoiding biases coming from sequence redundancy).
BLAST analysis efficiently identifies homologous sequences through highly significant alignments.3 However, because a high frequency of low significant alignments may reveal stretches of ancestral regions,21,22 we used a low-stringency BLAST strategy to include alignments beyond a significant similarity, a key step in our strategy to identify putative coding sequences with a biological and/or evolutionary significance (Fig. 1). To this end, AnABlast uses non-restrictive parameters including a low bit score threshold for considering hits. Bit score represents a normalized value that allows comparison of results from different assays. Because bit score thresholds are not allowed in BLAST tools, low bit score values (from 25) were established by using high E-values (up to 2,00,000). Using the appropriate scoring matrix during the BLAST search (among available BLOSUM and PAM standard matrices) can improve AnABlast sensitivity. In this study, the BLOSUM90 matrix was selected to favour the search of high identity sequences in very short alignments, enough to be considered protomotifs21 or ancient footprints of protein sequences.22 The low-complexity filter SEG is also used to avoid sequence repeats, thereby improving specificity. Sequence alignments were trained with coding sequences from the analysed genome (data not shown), and a bit score threshold (given by the score matrix and the search space size) that better highlighted coding regions was used. In the S. pombe genome, a threshold of 30 in the score alignment was optimal in our search strategy, corresponding to an E-value of ∼12,000 (the default E-value for searching significant hits is usually 10). Finally, up to a maximum of 10,000 alignments by query is allowed.
In a second step, AnABlast takes the positions from the acquired hits and counts the number of alignments (belonging to different clusters) that matches each position. As a result of this analysis, AnABlast yields a profile where discrete peaks highlight specific regions in the genome. Independently of the level of significance of each particular alignment, the high number of sequence alignments accumulated at a specific strand and reading frame reveals a significant region in terms of biological function and/or evolution.21,22 Under the above parameters, randomized and non-coding sequences yielded AnABlast profiles with peaks rarely reaching a maximum of 70 accumulated sequences. In this study, a cut-off threshold of 70 accumulated sequences was applied to optimize the identification of putative present and fossil-coding regions. The AnABlast algorithm is summarized in Fig. 2 and can be tested by downloading its different modules from http://www.bioinfocabd.upo.es/ab.
Analysis of the selected sequences
For all analysis, S. pombe annotations, protein and DNA sequences, and RNA expression data available from PomBase (http://www.pombase.org/) and C. elegans (https://www.wormbase.org/) on 1 February 2014 were used. The amino acid sequence delimited by each AnABlast peak was further studied manually by using BLAST3 and Pfam5 from their web applications (https://blast.ncbi.nlm.nih.gov/ and http://pfam.xfam.org/, respectively). To determine expression levels in AnABlast domains, genome-wide transcriptome Ribo-Seq data from proliferating and sporulating cells.18 Ribo-Seq and AnABlast data were visualized with the Integrated Genome Viewer.24 Genome-wide RNA-Seq data from cells subjected to temperature or nutritional stress14,19 were also used (available at http://www.pombase.org/).
Results and discussion
Setting up optimal AnABlast parameters
The significant accumulation of amino acid sequence alignments produced by AnABlast analysis may be useful to identify DNA-coding sequences of biological and/or evolutionary significance (Figs 1 and 2). To establish optimal AnABlast parameters for the accurate identification of coding sequences, we first analysed a genomic DNA region encoding known proteins. The S. pombe genome interval containing the universal protein kinase Cdc2 (encoded in the forward strand) and its flanking genes SPBC11B10.08 (forward strand), and pht1 (reverse strand) was used to this purpose. All six-frame translations of this DNA interval were obtained. To allow the complete analysis of the corresponding amino acid sequence independently of existing ORFs, stop codons were ignored (masked with ‘X’). As expected, AnABlast analysis produced a profile of discrete peaks that matched regions of annotated exons and RNA expression patterns (Fig. 3). Setting optimal thresholds in AnABlast parameters (bit score threshold in sequence alignments and cut-off threshold of accumulated sequences) is nonetheless a key question. The use of very low-stringency parameters to generate BLAST alignments (score = 28) would favour the identification of highly divergent coding sequences in our AnABlast strategy. However, the background noise caused by random matches at this permissive condition makes it difficult to discard false positives obtained by chance in non-coding sequences (Fig. 3A). On the other hand, setting of high alignment cut-off thresholds (score = 40) resulted in the efficient highlighting of only those coding regions with a significant similarity to sequences in databases, a pool of genes that are easily identified by conventional methods. By using a threshold of 30 in the score alignment, AnABlast peaks (accumulated sequences) generated in non-coding regions rarely reached the value of 70, while coding regions accumulated hundreds of sequences, yielding easily identifiable peaks (Fig. 3). Thus, a score alignment threshold of 30 in the BLAST search and a cut-off threshold of 70 in alignments accumulation were used as standard parameters for the selection of significant peaks identifying putative coding sequences.
Virtual translation of non-coding DNA generates random protein sequences. To better estimate false-positive rates under these parameters, we randomly shuffled the DNA sequence of the analysed genome interval. No significant peaks were found in 50 randomized sequences tested (see Fig. 3B), suggesting that significant peaks rarely occur by chance with the selected settings. Similarly, no significant peaks were observed when analysing the reverse 3′-5′ sequence (Fig. 3C).
Classic frameshift alignment analysis aids in the identification of frameshifts in DNA sequences.25 As AnABlast profiles identify coding sequences in the different reading frames, coding frameshifts can be also easily inferred from this computational analysis (see Cdc2 exons in Fig. 3A).
AnABlast analysis of inter-exon regions in the entire S. pombe genome
To perform a genome-wide search of new coding regions in fission yeast, we analysed all nucleotide sequences between every two adjacent annotated exons, accounting for a total of 5,870 intergenic regions and 4,782 introns. AnABlast profiles were generated for each of the six-frame translation sequences of these inter-exon regions. Among them, 158 regions contained at least one significant peak and were selected for further analysis. As a negative control, we randomized the corresponding DNA sequence in each of the analysed inter-exon regions. With this data set, only 19 significant peaks were obtained (AnABlast profiles obtained both in intergenic sequences, and its randomized versions are available at http://www.bioinfocabd.upo.es/anablast/supplementary_data.html). Thus, on average, >85% of the selected AnABlast peaks are expected to identify true-positive coding sequences.
Interestingly, all the selected profiles, except one, were located in intergenic regions, suggesting that in general, S. pombe introns do not usually contain coding sequences. The only intronic peak was located in the first intron of the SPAC22F3.11c gene, snu23 (Fig. 4A). RNA-Seq profiles, the identification of zinc-finger motifs, and its high identity to Schizosaccharomyces sps genes (Supplementary Table S1A) indicate that the sequence identified by AnABlast in this annotated intron is in fact a coding sequence of the snu23 gene (Fig. 4A). A similar suggestion was made from S. pombe transcriptome analysis.19
AnABlast peaks highlighting pseudogenes
Annotated transposons and pseudogenes were not excluded for intergenic regions analysis. Accordingly, a number of transposons and all 29 annotated pseudogenes were detected in AnABlast profiles (see examples in Supplementary Fig. S1). In addition, AnABlast peaks uncovered a new pseudogene (Fig. 4B). Reading frames in this pseudogene contain numerous stop codons and at least five frameshifts, a complex structure that makes its identification by conventional methods difficult (see Fig. 4B). BLAST analysis of predicted amino acid sequences indicated that this new pseudogene encodes a putative dipeptide transmembrane transporter protein (Supplementary Table S1A). A second new pseudogene uncovered within rearranged DNA regions will be described later.
Detailed analysis of AnABlast profiles generated in annotated pseudogenes suggested a number of corrections to their genomic annotations. Two additional coding domains (in different reading frames) flanking the SPBPB10D8.03 pseudogene suggest carboxyl- and amino-terminal expansions in the annotated region (Fig. 4C and Supplementary Table S1A). Another correction is suggested in the two annotated exons of the SPAPB24D3.05c pseudogene. AnABlast also identified two regions separated by a single frameshift, although region intervals and the reading frame for the first region differ from those in the annotated exons (Fig. 4D and Supplementary Table S1A). Aside corrections, AnABlast peaks matching Ribo-Seq data profiles in pseudogenes suggest that at least two of them could be reassigned as coding genes (Supplementary Fig. S2).
From these results, we conclude that our in silico strategy efficiently highlights coding sequences in actual genes (Fig. 3) as well as in pseudogenes (Fig. 4). Since part of the pseudogenization process includes loss of initiating methionine, nonsense, and frameshift mutations, AnABlast might be useful to identify new genes, but also evolutionary degenerated coding sequences that escape to conventional searches.
Identification of new genes
The S. pombe genome is extremely well annotated.16,20 Thus, a stringent test for AnABlast is the identification of new genes in the genome of this model organism. To this end, we initially looked for AnABlast peaks containing ORFs, and 18 of them were selected.
Pfam and BLAST searches found that three of the selected ORFs uncovered new copies of annotated genes (Supplementary Fig. S3). Importantly, four other ORF-containing AnABlast peaks identified putative new protein-coding genes, experimentally validated by Ribo-Seq expression patterns18 (Fig. 5 and Supplementary Table S1B).
The first novel gene partially overlaps the predicted poly(A)-bearing non-coding RNA SPNCRNA.31. Both the AnABlast ORF (42 amino acid length) and the annotated non-coding RNA are found in the forward strand, with the Ribo-Seq profile confirming the translation of the ORF (Fig. 5A).
The second new gene contains a short 28 residues ORF that is highly expressed (Fig. 5B). The corresponding AnABlast peak further extends the carboxyl-terminal end of this ORF (see in Fig. 5B), suggesting that this extremely short gene could be originated through a carboxyl-terminal subtraction from a longer ancient one.26 Polypeptides encoded by these two new genes lack significant homologs or motifs in databases.
The other two AnABlast ORFs, in addition to being translated, encode polypeptides that share significant similarity to known proteins. A main AnABlast peak, flanked by two others in different reading frames, identifies one of these new genes (see in Fig. 5C). Concatenated amino acid sequences from the three peaks render a 178-residue protein with a homologue in Schizosaccharomyces octosporus and eight in Schizosaccharomyces cryophilus, ranging in length from 102 to 190 residues (Supplementary Table S1B). RNA expression profile (see Ribo-Seq pattern in Fig. 5C) and its homology to the short EPY50361.1 S. cryophilus variant (see alignments in Fig. 5C) suggest that only the 56 residues ORF encoded by the first AnABlast peak is an active gene. Thus, the proposed S. pombe gene (and its equivalent in Schizosaccharomyces cryophilus) likely resulted from premature nonsense mutations, generating a fossil carboxyl-terminal sequence that in the remaining related proteins is still active.
Another AnABlast ORF encodes a 77-residue polypeptide sharing 29% identity (52% similarity) along the entire sequence of the WP_002758301.1 protein from Microcystis aeruginosa, an evolutionary distant freshwater cyanobacteria (Fig. 5D). Identity increases to 35% with homolog proteins found in filamentous fungi, but homology length decreases to the first 52 residues in these closer microorganisms. Identity with homolog sequences found in S. cryophilus and S. octosporus rises to 76% but is limited to 31 amino-terminal residues. Since RNA expression is also observed at this 31 amino acids interval, we conclude that this short region is an active coding sequence in fission yeasts. According to BLAST alignments (Fig. 5D), premature nonsense mutations could also result in the progressive carboxyl-terminal subtractions of this coding sequence over evolution.26 However, in S. pombe, no stop codon is found at the 3′ end position of the DNA sequence coding for the 31-residue domain (see in Supplementary Table S1). In fact, RNA-Seq data suggest the existence of at least two exons in this region (see Fig. 5D). Thus, in this particular case, subtraction likely raised in the S. pombe genome by the creation of a new splice site.27,28
We cannot rule out the existence of new genes in the remaining ORFs found by AnABlast search. However, neither significant translation levels nor sequence/motif conservation support their actual gene status (see examples in Supplementary Fig. S4).
The in silico identification of new genes is particularly useful for genome annotations. To test the potential use of AnABlast as a gene-finger tool in other organisms, we analysed the Chromosome II:0-5 Mb interval of the C. elegans genome. Annotated ORFs were efficiently highlighted in AnABlast profiles using the settings trained for the S. pombe genome. Furthermore, a detailed analysis of intergenic regions in this interval revealed that AnABlast profiles can also be useful to identify new genes in the nematode genome (Supplementary Fig. S5).
Ancestral gene subtractions
As shown above, our coding sequences search strategy can both find new genes and uncover their associated evolutionary history, revealing fossil-coding sequences that have being subtracted by changes in stop positions (Figs 5B and 3C) or in splice sites (Fig. 5D). To locate possible fossil coding sequences subtracted from present genes, we examined peaks located at the edge (exon boundaries) of AnABlast profiles at each inter-exon region. Peaks revealing amino acid sequences encoded in the same strand as adjacent exons were selected (putative ancient subtractions), and their corresponding amino acid sequences concatenated ignoring reading frameshifts and stop codons. Interestingly, BLAST searches identified two genes in which alignments to homologous proteins extended along the concatenated peptides, one by expanding the 5′ end coding DNA sequence and the other through a 3′ end extension (Fig. 6A and 4B, and Supplementary Table S1C).
Evolutionary changes in coding sequence length seem to be dynamic processes via shifts in the position of the start29 and the stop codons.26 Our findings indicate that AnABlast may be useful to identify subtractions occurred during evolution. Remarkably, AnABlast can identify long terminal changes which, according to previous studies,26,29 represent rare events among terminal subtractions and expansions. Since evolutionary expansions are part of present genes, AnABlast analysis of intergenic regions carried out in this study cannot uncover these evolutionary events.
Identification of gene fragments
The vast majority of the selected AnABlast profiles identified single peak regions without putative ORFs (63%, excluding pseudogenes and transposons). Since AnABlast identifies coding sequences independently of ORFs, most of these peaks could represent gene fragments arisen from mutations and/or genome rearrangements. Accordingly, careful BLAST and Pfam analysis of predicted amino acid sequences in some of these peaks revealed fragments of tf1-retrotransposon (Fig. 7A), of a MIP water channel protein (Fig. 7B), and of an amino acid permease (Fig. 7C). Interestingly, this later peak includes a possible fossil fragment of a bacterial transposase, fused to the amino acid permease sequence in the same reading frame (Supplementary Table S1D). This finding casts light on how random combinatorial protein domains, including exogenous DNA, may contribute to the emergence of new traits during evolution.
Our search strategy also identified rearranged fragments of the same gene. Two adjacent genomic intervals coding for partial sequences of a predicted surface glycoprotein (in different reading frames) (Fig. 7D), or up to three different subtelomeric regions encoding adjacent fragments of the RecA protein (encoded by the tlh1/tlh2 genes) provide good examples of rearranged genes. As shown by AnABlast profiles, at least two regions containing tlh1/tlh2 DNA fragments located in subtelomeric regions of Chromosomes I and II, respectively, harbour identical rearranged structures (see Fig. 7E and F). A fragment of the SPAC212.04c gene (DUF999 family protein 1) can also be observed in both subtelomeric intervals (see in Fig. 7), albeit only in the case of Chromosome II this gene fragment is annotated as pseudogene (SPBCPT2R1.05c). According to AnABlast peaks, this pseudogene should be also annotated as new pseudogene in the equivalent Chromosome I subtelomeric region.
Interestingly, the identical genomic gene fragments pattern revealed by AnABlast profiles in both subtelomeric intervals suggests that evolutionary events yielding such fragments likely occurred earlier than their translocations to the subtelomeric region. However, we cannot rule out that occasionally, these AnaBlast results could point to possible genome assembly errors rather than rearranged fragments and/or missing annotations.
As described above, BLAST analysis of a particular AnABlast peak identified a possible fossil transposase sequence, perhaps of bacterial origin (see in Fig. 7C). This observation led us to investigate whether other AnABlast peaks could identify fossil sequences30 transferred from other genomes.
Importantly, AnABlast identified a coding sequence containing a protein domain of the mitochondrial maturase protein-2 (SPMIT.03 mitochondrial gene), indicating that this genome fragment has a mitochondrial origin (Fig. 8A). Furthermore, additional transposase and trehalose synthase fragments (Fig. 8B and C, respectively) point to a possible horizontal transfer from bacteria as an evolutionary source for these intergenic sequences.31 A short region encoding amino acid sequences highly similar to cytochrome c oxidase subunit III was also identified (Fig. 8D).
Here we report that low-stringency alignments accumulation in predicted amino acid sequences may discern coding from non-coding DNA, thereby providing a new in silico strategy (AnABlast) to reveal present and ancient coding regions in sequenced DNA. We have shown that the identification of rearranged and fossil sequences by AnABlast may be helpful to better understand the evolutionary history of the S. pombe genome. This property, together with the ability to uncover novel genes and pseudogenes, suggests that AnABlast provides a new strategy of general interest for the genome-wide search of present and fossil sequences in any genome.
J.J. and A.P.P. conceived the study and designed the experiments. A.P.P. developed the computer programme. A.P.P. and M.G. generated AnABlast profiles. C.D.S.D. and J.M. performed Ribo-Seq experiments. J.J. did the analyses of AnABlast peaks and wrote the manuscript. J.J., C.D.S.D., M.G., J.M., and A.P.P. critically read and approved the manuscript.
This research was supported by the Ministry of Economy and Competitiveness of the Spanish Government grant BFU2013-46923-P (J.J.) and by Biotechnology and Biological Sciences Research Council (UK) research grant BB/J007153/1 (J.M.). Authors declare no competing interest. Funding to pay the Open Access publication charges for this article was provided by the Ministry of Economy and Competitiveness of the Spanish Government.
We thank Andrés Garzón, Manuel Muñoz. and Rafael R. Daga for comments on the manuscript.