Abstract

It is known that while the programs used to predict genes are good at determining coding nucleotides, there are considerable inaccuracies in the determination of the gene structural elements. Among them, the most notable is that of the exact boundaries of exons. In order to assess this, we had earlier reviewed various programs that predict potential splice sites and exons. The results led to the following two observations:(i) a high proportion of false positive splice sites from computatuional predictions occur in the vicinity of real splice sites; and (ii) current algorithms are misled to predict wrong splice sites more often when the coding potential ends within ±25 nucleotides from real sites than when it ends at farther positions. In this report, we review decision tree models for human splice sites and the resultant software tool, namely SpliceProximalCheck, that discriminates such ‘proximal’ false positives from real splice sites. Further presented is an integrated system (MZEF-SPC) with Splice ProximalCheck (SPC) as a front-end tool operating on the results of Michael Zhang's exon finder program, Examination of the output of the integrated program on an illustrative gene set revealed that as much as 61 of 93 MZEF-predicted false positive exons could be eliminated by SPC for a loss of only 3 out of 33 MZEF-predicted true positive exons.