The GLIMMER system for microbial gene identification finds ∼97–98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.
Accurate microbial gene identification is becoming ever more important with the increasing rate of whole genome sequencing projects. In the past year alone, eight new bacterial and archaeal genomes have appeared, and the pace continues to accelerate. Each new genome contains thousands of new genes, all of which are deposited into public databases. These genes then become the basis for much further research into the biology of these organisms, and their sequences are used for further biological study. For work such as microarray analysis, in which specific sequences are arrayed onto a substrate and used as probes to measure expression levels, the accuracy of gene predictions is critical. The same point can be made about knockout experiments, which are an important tool to use in determining the function of the large numbers of genes whose function is unknown at the time of publication. Such hypothetical proteins typically comprise 30–40% of the genes in a newly sequenced genome.
GLIMMER 1.0 is a computational gene finder that finds 97–98% of all genes in a prokaryotic genome without any human intervention (1). The system can be quickly and easily trained using only the genome sequence of interest. The technical underpinning of the system is an interpolated Markov model (IMM), a generalization of Markov chain methods. GLIMMER 1.0 has been used as the gene finder for Borrelia burgdorferi (2), Treponema pallidum (3), Chlamydia trachomatis (4) and Thermotoga maritima (5), and the software is in use at over 100 laboratories and institutes. Below we describe the algorithm and performance results of GLIMMER 2.0, a gene finder that incorporates several technical improvements to the GLIMMER 1.0 algorithm. As a result of these improvements, GLIMMER 2.0 has slightly higher sensitivity than GLIMMER 1.0 and is much better at resolving overlapping gene calls. The latter property is especially useful for genomes such as Deinococcus radiodurans, which due to their high GC-content have numerous long open reading frames (ORFs) that can easily lead to predictions of genes whose boundaries overlap incorrectly.
Methods and Algorithms
We begin by briefly reviewing Markov models in the context of DNA sequence analysis. We then describe the probabilistic model used in GLIMMER 2.0 to identify regions that are likely to be genes. We then describe how GLIMMER 2.0 resolves conflicts when overlapping genes are predicted. The complete GLIMMER 2.0 system is available from The Institute for Genomic Research at http://www.tigr.org/softlab
A Markov chain is a sequence of random variables Xi, where the probability distribution for each Xi depends only on the preceding k variables Xi-1,⋯, Xi-k, for some constant k. For DNA sequence analysis, a Markov chain models the probability of a given base b as depending only on the k bases immediately prior to b in the sequence. We refer to these preceding k bases as the context of base b in the sequence. The most common type of Markov chain is a fixed-order chain, in which the entire k-base context is used at every position. For example, a fixed 5th-order Markov chain model of DNA sequences comprises 45 = 1024 probability distributions, one for each possible 5mer context. Such fixed 5th-order models have proven effective at gene prediction in bacterial genomes (6,7).
Ideally, larger values for k are always preferable. Unfortunately, because the training data available for building models is limited, we must limit k. In most collections of DNA coding sequences, however, there is substantial variability in the frequency of occurrence of different kmers.
IMMs are a generalization of fixed-order Markov chains that combine contexts of different lengths to compute the probability of base b. Our formulation allows each context to have a weight based in part on its frequency; this allows the IMM to be sensitive to how common a particular oligomer is in a given genome. In particular, rare kmers should not be used for prediction; the IMM will ignore these in favor of shorter Markov chains. On the other hand, some long kmers may occur very frequently, and for those the IMM can give the longer context more weight and make a better prediction. These weights define an interpolated probability distribution that incorporates information from multiple Markov chains. An IMM can emulate a fixed kth-order chain simply by setting all weights to zero except for those associated with k.
Details of how to construct an IMM for sequence data have been described previously (1). For coding regions, GLIMMER 1.0 builds three separate IMMs, one for each codon position. [This is known as a 3-periodic Markov model (6).] These IMMs include 0–8th order Markov chains, as well as weights computed for every oligomer of eight bases or less that appears in the training data. These weights and Markov models are interpolated to produce a score for each base in any potential coding sequence. The logs of these scores are summed to score each coding region.
The interpolated context model
Interpolated context models (ICMs) are a further extension of IMMs. For a given context C = blb2 ⋯ bk of length k, the IMM in GLIMMER 1.0 computes a probability distribution for bk+l using as many of the bases immediately preceding bk+l as the training data set allows. The ICM is more flexible and can select any of the bases in C (not just those adjacent to bk+l) to determine the probability of bk+l. In general, from a given context, the ICM will choose approximately the same number of bases as the IMM. Our motivation for choosing bases other than those at the end of the context is the fact that in coding regions the significance of a given base depends strongly on its position in a codon; e.g. the nucleotide in the third codon position is sometimes irrelevant to the amino acid translation. The criterion employed by the ICM to select which bases of a context C to use is mutual information. The mutual information between a given pair of discrete random variables X and Y is defined to be:
To construct an ICM with context length k from a training set T of DNA sequences, we begin by considering all windows (i.e. oligomers) of length k+1 that occur in T. We let random variable X1 be the distribution of bases in the first position of those windows; X2 be the distribution of bases in the second position; and so on through Xk+1. We then calculate the mutual information values I(X1; Xk+l), I(X2; Xk+l), ⋯, I(Xk; Xk+l), and choose the maximum. Suppose that maximum is I(Xj; Xk+l). We then partition our set of windows into four subsets based on the nucleotide that occurs in position j in the window.
The same procedure can now be performed again for each of the four sets of windows. Within each set, the position that has the highest mutual information with the base at position k+1 is chosen. The four nucleotide values at that position induce a further partitioning of the current set of windows into four subsets.
This process can be viewed as constructing a tree of positions within context strings. A sample portion of such a tree is shown in Figure 1. The construction is terminated when the tree depth reaches a predetermined limit, or when the size of a set of windows becomes too small to be useful to estimate the probability of the last base position.
Each node in the ICM decomposition tree represents a set of windows that provide a probability distribution for the final base position. The root node, which includes all possible windows, represents a 0th-order Markov model. All other nodes give a probability distribution for the final base position, conditional on a specific set of bases occurring at the positions indicated on the path to the root from that node.
Note that the IMM used in GLIMMER 1.0 is a special case of this ICM, namely the case where the base chosen at each level of the tree is the last available base in the context window. Thus, when the nearest positions to base bk+l provide the strongest evidence for its value, the ICM automatically chooses them and the result is identical to the IMM. But when other bases provide stronger evidence, as is often the case, the ICM will choose them instead.
The interpolation mechanism used in the ICM is identical to that used in GLIMMER 1.0. It takes a weighted sum of two probability distributions, where the weights are determined by the number of training instances used to construct the distribution and its statistical significance as measured by a χ2 test. The only difference is that the ICM interpolation is naturally viewed as interpolating between the distributions at a parent and child node in the tree, while the IMM interpolation is always between distributions obtained using different numbers of bases at the end of the context window.
The interpolated context model presented here is essentially a probabilistic decision tree, i.e. a sparse probability distribution expressed as a decision tree. The tree construction is identical to constructing classification trees using information gain as the splitting criteria (8). Classification trees associate a class label with each leaf node of the tree. The labels in our case are the four nucleotide values, and our interpolated context model determines a probability distribution for the base to be predicted given the context in which it occurs. Probabilistic decision trees have been designed for other applications (9–11). In computational biology probabilistic decision trees have been used for modeling splice site junctions (12) and exon modeling (13).
Resolving overlapping genes
In developing GLIMMER 2.0, a conscious effort was made to reduce the number of false negative gene predictions at the expense of a slight increase in the number of false positive predictions. Upon close examination of GLIMMER 1.0S output, we learned that occasionally a gene was discarded because its start codon was positioned too far in the 5′ direction, resulting in substantial overlap with another gene. GLIMMER 2.0 solves this problem by incorporating additional rules to resolve such overlaps.
In GLIMMER 1.0, when two potential genes A and B overlap, the overlap region is scored. If A is longer than B, and if A scores higher on the overlap region, and if moving B's start site will not resolve the overlap, then B is rejected.
In GLIMMER 2.0, when potential genes A and B overlap, the overlap region is scored just as in GLIMMER 1.0. The system attempts to move the locations of the start codons much more aggressively, as follows. Suppose gene A scores higher, now four different orientations are considered:
In this case, postponing the start site of either A or B does not remove the overlap. If A is significantly longer than B (as determined by a program parameter), then B is rejected. Otherwise, both A and B are called genes, with an annotation that there was a doubtful overlap.
Only moving the start of B can resolve the overlap. If it can be moved, then it is. If not, and if B is significantly shorter than A, then B is rejected. Otherwise, both are listed as genes, with a note indicating the overlap. Moving a start codon works as follows: the system shortens the predicted gene by shifting the start location to the next available start codon. If this does not resolve the overlap, it moves the start codon again. This process continues as long as the resulting gene is longer than the minimum gene length (an easily adjustable parameter).
Only moving the start of A can resolve the overlap. Since A scores higher, we only try to move it if the overlap is a relatively small fraction of A's length. If adjusting A is not successful, B is rejected.
Both starts can move. We first move the start of B until the overlap region scores higher for B. Then we move the start of A until it scores higher. Then B again, and so on, until either the overlap is eliminated or no further moves can be made.
An additional step is taken by GLIMMER 2.0 to help find genes that previously were missed because the score from the independent probability model was too high. The independent probability model is used by both versions of the system to compete against the IMMs used to score all six reading frames; its purpose is to serve as a model of non-coding DNA. In order to be called a gene, an ORF must score higher than the independent model as well as the other five reading frames. Genes that were missed due to high scores from this independent model will fall in between the genes predicted by GLIMMER 1.0. For a target ORF in such regions, GLIMMER 2.0 considers the scores on subsequences of that ORF as compared to other overlapping ORFs. If these subsequences receive sufficiently high scores, and if the ORF scores relatively high in relation to the independent model (even though it did not exceed the normal score threshold to be called a gene), then it is added to the list of prospective genes.
The process of evaluating overlaps in GLIMMER 2.0 is performed in an iterative fashion in order to avoid rejecting genes unnecessarily. For example, in the case where ORF A causes ORF B to be rejected, and B in turn causes C to be rejected, we wish to reject only B and not both B and C. Thus, we perform the rejection phase in multiple stages, first discarding B and then checking again for overlaps.
We analyzed 10 completed microbial genomes: Haemophilus influenzile (14), Mycoplasma genitalium (15), Methanococcus jannaschii (16), Helicobacter pylori (17), Escherichia coli (18), Bacillus subtilis (19), Archaeoglobus fulgidus (20), B.burgdorferi (2), T.pallidum (3) and T.maritima (5). On each of the genomes, we ran both GLIMMER 1.0 and GLIMMER 2.0. All parameters were the defaults, although adjusting these default settings will improve performance on selected genomes. The training data was identical in every case in order to ensure a fair comparison.
The method of training was as follows: using only the genome itself as input, we extracted all ORFs longer than 500 bp from each genome. From these long ORFs, only those that did not overlap other long ORFs were retained; this produces a set of ORFs that are highly likely to be coding. (The programs to perform this extraction are included in the GLIMMER package; total runtime is <1 min on a standard desktop PC.) For all genomes in this study, this set contains more than enough data to train the system accurately.
Next, the IMM training was conducted using the original GLIMMER 1.0 program and the new, tree-structured ICMs for GLIMMER 2.0. These models were then used to identify genes in the complete genome. For all genomes, ranging in size from 0.5 to 4.7 Mb, training GLIMMER 1.0 or GLIMMER 2.0 takes <1 min on a Pentium 400 PC running the Linux operating system. The gene finding step takes an additional 1 min or less.
The results of the comparison are summarized in Tables 1–4. In all 10 genomes, there are only 12 confirmed annotated genes that GLIMMER 1.0 found that GLIMMER 2.0 did not. In all these results, we have not discounted gene predictions that fall into known ribosomal RNA or tRNA regions. Since such regions are easy to identify independently of GLIMMER, this step should be a routine part of any annotation process.
A second set of experiments was designed to find the true accuracy of GLIMMER. In the original study (1), GLIMMER 1.0's gene calls were compared to the published annotation for several completed genomes. The results of this study showed that GLIMMER 1.0 was able to find 97–98% of annotated genes fully automatically, using neither database searches nor human intervention; however, published annotation is not 100% accurate.
Therefore the question remains open as to how accurate these predictions really are. This second experiment is an attempt to answer that question more precisely.
In order to measure accuracy more precisely, we extracted a subset of genes from the published annotation for each genome. These subsets include only those genes that have significant homology to known proteins, as indicated in the published annotation. Many of these genes have a functional assignment, but some are homologous to other genes of unknown function (these are sometimes annotated as ‘conserved hypothetical’ proteins). We included the latter in the experiment because the existence of homology itself is very strong evidence that the sequence encodes a protein. Except for the use of only a subset of annotated genes, all other details of the experiments were the same as for Table 1. The results of this second comparison are summarized in Table 2.
The results make it clear that GLIMMER is more accurate on genes confirmed by sequence homology than it is on the remaining genes. For GLIMMER 1.0, sensitivity ranges from 98.4 to 99.7%, with an average of 99.1%. For GLIMMER 2.0, the range is 98.6–99.8%, with an average of 99.3%. In contrast, GLIMMER 1.0's average accuracy on the complete set of annotated genes for all 10 genomes is 98.1%, and GLIMMER 2.0's average on those genes is 98.6%.
Table 5. Genes in M.tuberculosis found automatically by GLIMMER 2.0 with homology to protein sequences from other organisms
In each of the 10 genomes, GLIMMER 2.0 found more conserved genes than GLIMMER 1.0. Usually the number was very small, only 1–5 genes for eight of the genomes. However, the set of conserved genes found by GLIMMER 2.0 was not a strict superset of those found by GLIMMER 1.0. We intersected the two sets and compared them in order to identify which genes were found by both systems and which were found exclusively by one or the other. These results are shown in Table 4. As the table shows, for each genome there are 0–4 genes found by GLIMMER 1.0 and missed by GLIMMER 2.0. There are three genomes, M.genitalium, M.jannaschii and A.fulgidus, in which all conserved genes found by GLIMMER 1.0 are found also by GLIMMER 2.0. Typically, genes found by GLIMMER 1.0 but not found by GLIMMER 2.0 are relatively short and score just below the minimum scoring threshold. For example, in B.burgdorferi the gene found by GLIMMER 1.0 and not by GLIMMER 2.0 is a 74-amino-acid ribosomal protein S14 (BB0491). The GLIMMER 2.0 score for this gene was 88, just below the default threshold value of 90. Such genes could be included in GLIMMER 2.0's predictions with suitable parameter adjustments, although at a cost of additional false-positive predictions.
In order to demonstrate that GLIMMER 2.0 has a higher sensitivity than alternative gene-finding methods, we analyzed a recently sequenced genome, Mycobacterium tuberculosis strain H37Rv (21), for which GLIMMER 2.0 was not among the computational methods used for annotation. Table 5 summarizes the genes that were found by GLIMMER 2.0 but missed in the original annotation, and that have detectable homology to a coding region from another organism. For each of the 13 genes identified, the table lists the function and identifier of the best hit found by a BLAST search. Eleven of the genes occur in intergenic regions in the published annotation of the complete genome, and the remaining two (those whose closest homologs are P17996 and Q02541) have relatively small overlaps with coding sequences annotated as hypothetical. GLIMMER 1.0 finds 11 of these 13 genes, missing those homologous to P17996 and Q02541.
It is worth noting too that the false-positive rate appears to be higher for GLIMMER 2.0, as reflected in the fact that the number of additional genes (not confirmed by database matches) predicted by GLIMMER 2.0 is higher in nine of the 10 genomes. Because of its revised rules to resolve overlapping ORFs, GLIMMER 2.0 generally makes more gene predictions than GLIMMER 1.0 when all parameters are set identically as in the above-described results. To verify that the additional annotated matches found by GLIMMER 2.0 are not attributable merely to the greater number of predictions, we compared the two systems with GLIMMER 1.0's parameters set so that the total additional gene predictions for all 10 genomes matched GLIMMER 2.0. Specifically, we raised the overlap-length parameter, which is the maximum number of DNA bases by which two ORFs can overlap and both still be predicted as genes. The results are shown in Table 6. With this adjustment GLIMMER 2.0 still finds 99 more annotated genes than GLIMMER 1.0, indicating that its predictions are in fact more accurate than GLIMMER 1.0. The parameters of either system can be adjusted to reduce the number of additional genes, at the cost of missing some true genes.
In this paper we have described several technical improvements made in the GLIMMER 2.0 gene-finding system and argued that the system is more accurate than previously recognized. GLIMMER 2.0 also can be an effective gene finder for eukaryotic genomes, especially those with a high gene density as is found in some parasites. For example, it is being used as the main gene finder for the parasite Trypanosoma brucei, the agent that causes African sleeping sickness, which currently is being sequenced at The Institute for Genomic Research. This parasite has few or no introns and a gene density estimated at 50%. The IMM scoring method in GLIMMER 1.0 has also been used to create a eukaryotic gene finder, GLIMMERM, that has been quite successful in finding genes in the genome of Plasmodium falciparum, the malaria parasite (22).
A.L.D. was supported by NSF Grant IIS-9820497. S.K. was supported by NSF Grant KDI-9980088. O.W. was supported by the Department of Energy Grant No. DE-FC02-95ER61962.A003. S.L.S. was supported by NSF Grant IIS-9902923 and NIH Grants R01 LM06845-01 and K01-HG00022-1. S.L.S. and A.L.D. were supported by NSF Grant IRI-9530462.