-
PDF
- Split View
-
Views
-
Cite
Cite
Jelle J. Goeman, Ulrich Mansmann, Multiple testing on the directed acyclic graph of gene ontology, Bioinformatics, Volume 24, Issue 4, 15 February 2008, Pages 537–544, https://doi.org/10.1093/bioinformatics/btm628
Close -
Share
Abstract
Motivation: Current methods for multiplicity adjustment do not make use of the graph structure of Gene Ontology (GO) when testing for association of expression profiles of GO terms with a response variable.
Results: We propose a multiple testing method, called the focus level procedure, that preserves the graph structure of Gene Ontology (GO). The procedure is constructed as a combination of a Closed Testing procedure with Holm's method. It requires a user to choose a ‘focus level’ in the GO graph, which reflects the level of specificity of terms in which the user is most interested. This choice also determines the level in the GO graph at which the procedure has most power. We prove that the procedure strongly controls the family-wise error rate without any additional assumptions on the joint distribution of the test statistics used. We also present an algorithm to calculate multiplicity-adjusted P-values. Because the focus level procedure preserves the structure of the GO graph, it does not generally preserve the ordering of the raw P-values in the adjusted P-values.
Availability: The focus level procedure has been implemented in the globaltest and GlobalAncova packages, both of which are available on www.bioconductor.org.
Contact:j.j.goeman@lumc.nl
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
In microarray data analysis, there has been a shift of interest towards methods that analyze genes in the context of their annotation. Annotation allows genes with similar function or location to be grouped and analyzed together based on information from sources such as KEGG (Ogata et al., 1999) or Gene Ontology (Ashburner et al., 2000).
In recent years several statistical methods have been proposed for analyzing sets of genes with similar annotation. Most of these methods start from univariate measures of differential expression (usually P-values) obtained from single gene analyses, using these to test for over-representation (‘enrichment’) of differentially expressed genes in the gene sets of interest (Khatri and Drăghici, 2005; Mootha et al., 2003). Other methods start from the raw data and test for differential gene expression profiles over the gene set in a multivariate way without first going through single gene testing (Dinu et al., 2007; Goeman et al., 2004; Mansmann and Meister, 2005; Tomfohr et al., 2005).
The most popular source of annotation used in this context is the Gene Ontology (GO) database. GO is popular because it is not only large and detailed, but also highly structured: the GO terms are ordered in a directed acyclic graph, in which the set of genes annotated to a certain term (node) is a subset of those annotated to its parent nodes. Some authors (Alexa et al., 2006; Grossmann et al., 2006) have proposed to take into account the GO graph structure when testing for enrichment.
In this article we propose a way to exploit the GO graph structure for multiple testing correction. The structure of the GO graph complicates the application of multiple testing procedures. On the one hand, overlap between sets of genes and correlations between genes lead to highly complicated and unknown dependency structures between test statistics. This means that methods that make assumptions on the joint distribution of the test statistics, such as Westfall and Young (1993) and others (Dudoit et al., 2003), should be used with caution, while methods for which the validity is assured under any dependency structure, such as the procedures of Bonferroni or Holm (1979), can be overly conservative.
The structure of the GO graph also presents special opportunities, however, as this structure induces logical relationships between the null hypotheses, which can be exploited (Shaffer, 1986). However, at this moment no currently available method makes use of the GO structure for multiple testing. Meinshausen (2007) and Yekutieli (2007) have suggested useful methods, but both of these are for trees only and do not work in a directed acyclic graph.
In this article we present a sequentially rejective multiple testing method that strongly controls the family-wise error rate (Dudoit et al., 2003) while explicitly making use of the directed acyclic graph structure of GO. This focus level procedure merges Holm's (1979) procedure with the closed testing procedure of Marcus et al. (1976). Like the two procedures it combines, the new procedure makes no assumptions on the joint distribution of the test statistics.
Because the focus level procedure follows the structure of the GO graph, the collection of rejected null hypotheses is always a coherent subgraph of the GO graph, growing from the root node up to the most specific nodes that can be called significant. This coherence greatly facilitates interpretation. However, a consequence of the preservation of the graph structure is that the ordering of the raw P-values is not preserved in the multiplicity-adjusted P-values. It may even occur that some multiplicity-adjusted P-values are smaller than their corresponding unadjusted P-values.
The new procedure will be illustrated with two applications using the biological process GO graph. One uses the Global Test method on a microarray data set of 295 breast cancer patients from the Netherlands Cancer Institute (Van de Vijver et al., 2002). A second example, available in the Supplemental information, uses the Global Ancova method on a data set of 238 AML patients from the University Medical Center Grosshadern in Munich (Schnittger et al., 2005).
The focus level procedure has been implemented in the R packages globaltest and GlobalAncova. Both packages are available on www.bioconductor.org.
2 TESTING IN THE GO GRAPH
GO (Ashburner et al., 2000) is a controlled vocabulary that describes gene and gene product attributes. It can be represented as three disjoint directed acyclic graphs describing biological process, molecular function and cellular component. Each has a single root and thousands of dependent nodes. The three graphs are sometimes joined together by an overall root node called gene ontology, but in this article we consider GO as three separate graphs and speak of a GO graph. We do not make a distinction between the two types of relationships in the GO graph (‘is a’ and ‘part of’). When talking of the GO graph we visualize the root node of the graph as the ‘top’ of the graph and the end nodes as the ‘bottom‘.
For the purposes of this article, we will look at a GO graph as a structured collection of sets of genes. Genes are annotated to GO terms, so that each node in the gene ontology graph can be associated with a set of genes. The structure of the graph dictates that each gene annotated to a certain node is automatically also annotated to all ancestor nodes, so that, viewed in terms of sets, every edge in the graph represents a subset relationship: each child node is a subset of all its parent nodes, and the root node in each GO graph includes the superset of all nodes. Note, however, that not all subset relationships necessarily correspond to an edge in the GO graph, because genes may have multiple functions.
Every gene set has a corresponding null hypothesis that can be tested. We assume that these null hypotheses are formulated in such a way that they reflect the relationships in the GO graph. In general, we let any subset relationship between gene sets imply a logical relationship between the corresponding null hypotheses. Let A, B and C be gene sets, and HA, HB and HC the corresponding null hypotheses, then we assume that Note that the null hypothesis corresponding to the set of all genes on the chip is false in this setup whenever any null hypothesis is false.
If B ⊆ A, then truth of HA implies truth of HB
If A = B ∪ C, then simultaneous truth of HB and HC implies truth of HA
The assumptions hold for the multivariate gene set testing methods of Global Test (Goeman et al., 2004, 2005), Global Ancova (Hummel et al., 2007; Mansmann and Meister, 2005), PLAGE (Tomfohr et al., 2005) and SAM-GS (Dinu et al., 2007) which test hypotheses of the form (1). Methods that test enrichment of GO nodes (Khatri and Drăghici, 2005; Mootha et al., 2003) are typically not consistent with these assumptions. See Tian et al. (2005) and Goeman and Bühlmann (2007) for an overview of different null hypotheses used in gene set testing.
It should be remarked that the logical structure that holds for the null hypotheses generally should not be expected to hold for the raw, non-multiplicity adjusted test results. As the power of the tests depends on the gene set, it may, for example, easily occur in practice that HB is significant but HA is not, while B ⊆ A.
3 BOTTOM-UP AND TOP-DOWN PROCEDURES
The focus level procedure we propose for the GO graph is a combination of the procedure of Holm (1979) and the closed testing procedure of Marcus et al. (1976). It is actually not a single procedure, but a sequence of procedures that depends on a user-chosen focus level. At the two extreme choices of the focus level, the procedure is either bottom-up, and purely based on Holm, or top-down, and purely based on closed testing. We'll discuss these two extremes first. Let α be the chosen significance level.
A bottom-up procedure based on Holm would first look only at the collection of all end node null hypotheses of the GO graph. If this collection has m elements, it compares the smallest P-value to α/m. If the P-value is smaller, the corresponding null hypothesis is declared significant, and the second smallest P-value is compared to α/(m − 1). The procedure continues until the kth smallest P-value is larger than α/(m − k + 1). At this point, all m − k + 1 largest P-values are declared non-significant. At this point, significance can be propagated to all ancestors in the GO-graph: a higher level GO node can be declared significant whenever it has any significant offspring.
It is easy to show that this procedure keeps strong control of the family-wise error rate at level α. By the proof of Holm's procedure (Holm, 1979) the family-wise error rate is controlled for the collection of end node null hypotheses. Further, by the logical structure of the null hypotheses, the upward propagation of significance through the GO-graph can only lead to a false rejection if there was already a false rejection in one of the end nodes. Hence, it can never introduce a first false rejection and so does not increase the probability of making a first false rejection. For a formal proof, see Section 5.
An advantage of this bottom-up procedure is that it is quick, as the number of tests that has to be preformed is smaller than the number of nodes in the GO graph. A disadvantage is that it is not very efficient. The GO graph fans out like a tree and has a great number of end nodes, so that the multiple testing correction can still be quite severe. The bottom-up procedure, being focused at the leaves of the graph, may easily find a strongly significant effect in a few genes in one of the end nodes, but it may just as easily miss a weaker but more consistent effect of a large number of genes in a gene set at a higher level in the graph.
An alternative to the bottom-up procedure is a top-down procedure based on the closed testing procedure of Marcus, Peritz and Gabriel. To be able to use this procedure we must enlarge the graph to be tested so that the collection of gene sets in the nodes becomes closed to the union operation. That is, for any gene sets A and B corresponding to nodes in the graph, a node corresponding to A ∪ B must also be included, and this process must be continued recursively until all possible (multiple) unions are included. New edges are added in the expanded graph for every direct subset relationship. This closing of the graph can dramatically increase the number of hypotheses to be tested if the original number of hypotheses was already large, as the size of the expanded graph increases exponentially with the size of the original graph. We illustrate the closing of the GO graph in Figure 1.
Illustration of the expansion (‘closing’) of a graph for use in a Closed Testing procedure. Left: original graph. Right: expanded graph.
Illustration of the expansion (‘closing’) of a graph for use in a Closed Testing procedure. Left: original graph. Right: expanded graph.
In the first step the top down closed testing procedure tests only the top node, which is the union of all GO gene sets, at level α. If that test is not significant, the procedure stops. Otherwise, at every next step it tests those gene sets of which all supersets (i.e. all ancestors in the expanded graph) have been declared significant, and when there are no such sets, the procedure stops. At this point all gene sets which have not been tested yet are declared non-significant. All tests in the closed testing procedure can be performed at level α, but the procedure still keeps strong control of the family-wise error rate at level α, as has been very elegantly proved by Marcus et al. (1976).
The property that all tests are done at level α makes the closed testing procedure very attractive. It is a very efficient multiple testing procedure if the test statistics are highly correlated, which makes it especially useful for the GO graph. However, a closed testing procedure can easily miss a strong but isolated effect in one of the end nodes in a data set where very few genes show an effect: to be called significant, the effect of such a small end node must remain significant when it is watered down in all possible unions of the significant set with all possible other sets. Moreover, unless the initial number of nodes to be tested is very small, the computational burden can be crippling due to the huge size of the expanded graph. For example, in the breast cancer data set used in Section 7 the biological process GO graph with 2865 non-empty nodes would expand to a closed graph with around 8.5 × 10270 nodes.
The bottom-up and top-down multiple testing procedures are mirror images of each other in many ways, and they have opposite strengths. The bottom-up method is good at finding local ‘hotspot’ effects, where only a single end node is highly significant, even when most of the other nodes are non-significant. Conversely, the top-down method is good at finding weak global effects, where many genes show a small effect.
In the top-down method significance of GO terms low down is greatly influenced by the (lack of) significance of large gene sets near the top of the graph: a small node with a highly significant raw P-value may fail to be significant in the presence of a large non-significant node. Conversely, in the bottom-up method significance of GO terms near the top of the graph is greatly influenced by the presence of highly significant end nodes: a large node with a significant raw P-value may fail to be significant if none of its offspring terms individually reaches the Bonferroni threshold.
4 THE FOCUS LEVEL METHOD
The focus level method strikes a balance between the two extremes described above. The basic idea is very simple: instead of starting at the top or at the bottom of the graph, we start somewhere the middle, at a pre-chosen level in the GO graph called the focus level. We then use the bottom-up method to extend significance to higher levels and the top-down method to extend it to lower ones.
In the most general form a focus level is simply a collection of GO terms. However, for an efficient procedure it is recommended that the focus level truly represents a level in the graph, meaning that no term in the focus level has either ancestors or offspring which are also in the focus level. It is also recommended that all other GO nodes of interest are either ancestors or offspring (or at least ancestors of offspring) of focus level nodes, because otherwise the focus level procedure will not reach all nodes of interest. A third consideration is computation time, as discussed in Section 6.
The full GO graph
can now be split at the focus level into m subgraphs
1, … ,
m, each with a single focus level node as root node and consisting of all its offspring nodes. Note that the subgraphs may partly overlap. Each of these subgraphs will be expanded separately in the same way as in the closed testing procedure (Section 3). Call these expanded subgraphs
.
We also define an expanded scope graph
. Its nodes are given by the union of all nodes in the expanded graphs
. It gets a new set of edges which correspond to all subset relationships:
is a parent of
if the corresponding gene sets R of r and S of s fulfil R ⊇ S.
The focus level multiple testing procedure revolves around a collection ℋ ⊂
of testable hypotheses. At the start of the procedure ℋ is the collection of focus level hypotheses. The focus level procedure also uses a multiple testing correction factor h, which we call Holm's factor. It starts as m, the number of elements in ℋ.
The steps of the focus level procedure are outlined in Table 1.
The focus level algorithm
| Repeat |
|
| Until no new rejections occur |
| Repeat |
|
| Until no new rejections occur |
The focus level algorithm
| Repeat |
|
| Until no new rejections occur |
| Repeat |
|
| Until no new rejections occur |
This procedure keeps the family-wise error rate at level α, as we shall prove in the Section 5. The algorithm is presented in its most simple form here. Several adjustments can be made to increase computation speed; we consider these in Section 6.
It is easy to adapt the algorithm to let it produce multiplicity-adjusted P-values. A multiplicity adjusted P-value (Dudoit et al., 2003) for a hypothesis is defined as the smallest α-level at which that hypothesis will be called significant. For the focus level procedure these can be calculated as follows. Simply start at α = 0 and run the algorithm. Amend the algorithm so that every time no (more) rejection can be made, it does not stop, but increases α to the next value that allows a new rejection, and continues. All rejected hypotheses get an adjusted P-value equal to the α-level at which they were rejected. Stop the algorithm only when α reaches 1, at which point all unrejected hypotheses get adjusted P-value 1.
The focus level procedure is a true generalization of the top-down and bottom-up procedures described in Section 3. If the root node is taken as the focus level, the focus level procedure is exactly the top-down closed testing procedure (nothing happens in steps 2 and 4). Similarly, if the collection of all end nodes is taken as the focus level, the focus level procedure is exactly the bottom-up procedure (nothing happens in step 3).
The focus level procedure presents a generalized procedure whose properties are between the bottom-up and top-down procedures. Just as the top-down procedure has good power for detecting weak global effects and the bottom-up procedure has good power for detecting strong local effects, the focus level procedure has good power for detecting intermediate effects near the focus level. In the top-down procedure the significance of end nodes is influenced by significance of nodes near the root of the graph, and in the bottom-up procedure the significance of nodes near the root is influenced by significance of end nodes. Similarly, in the focus level procedure significance of nodes far from the focus level is influenced by the significance of nodes at the focus level. The choice of the focus level should therefore be considered carefully: the focus level should ideally be chosen at the level of specificity that is of most interest to the researcher, taking into account the restrictions of computability (see Section 6).
The significant subgraph returned by the focus level procedure is always a coherent graph, because step 2 of the algorithm guarantees that there is a path from any significant node through the significant subgraph to the root node. In this sense the result of the procedure reflects Assumption 1 of Section 2: that assumption on truth and falsehood of null hypotheses also holds for rejection and non-rejection. Similarly, reflecting Assumption 2, a significant node will often have at least one offspring node that is significant. This does not always happen, however. In the first place, it may be that the significance of the higher level node is due to genes which are annotated to the higher level GO term but not to any of its offspring terms. Alternatively, it may be that the significant signal of the higher level node is due to the combination of weak signals from its offspring nodes, none of which is strong enough to be detected by itself. In both of these cases the interpretation of the result should be that there is a clear signal in the general GO term, which cannot be safely attributed to any specific offspring term.
It is interesting to note that the focus level algorithm (just as the bottom-up procedure) can produce multiplicity-adjusted P-values which are smaller than the corresponding raw P-values. This can only happen to terms above the focus level terms, and occurs in step 2, where hypotheses are rejected based on the P-values of their offspring, without reference to the P-value of the hypothesis itself. An example of this effect will be shown in Section 7. If this property of the procedure is felt to be undesirable for interpretational reasons, the algorithm can be easily amended to only reject in step 2 if the raw P-value of the ancestor hypotheses is smaller than α. However, this may result in a significant graph that lacks coherence.
5 PROOF OF FWER CONTROL
As the focus level procedure merges Holm's procedure with the closed testing procedure, the proof that the new procedure controls the Family-wise error rate essentially merges the two elegant proofs of Marcus et al. (1976) and Holm (1979).
i, i = 1, … , n, let
i be the collection of nodes in
i that correspond to true null hypotheses. As each node corresponds to a set of genes, we can represent
i as {Ti,1, … ,Ti,ni}, a collection of ni gene sets corresponding to true null hypotheses in
i. Let Let k = #{i : Ti ≠ ∅}, the number of subgraphs containing true null hypotheses, and let i1, … , ik be the indices i such that Ti ≠ ∅. Let H1, … , Hk be the null hypotheses corresponding to Ti1, … ,Tik. Note that by assumption 2 in Section 2H1, … , Hk are themselves true null hypotheses.The first true null hypothesis that is rejected is one of H1, … , Hk. This must happen during step 1 of the algorithm.
If all H1, … , Hk have raw P-values greater than α/k , none of them will be rejected.
To prove 1, first remark that by Assumption 1 in Section 2 any offspring in
of a true hypothesis is itself a true hypothesis. Therefore, step 2 can never lead to a first false rejection: if the rejected ancestor in step 2 was a true null, its offspring rejected in step 1 was already a true null itself.
The first false rejection must therefore take place in step 1. To show that the first false rejection in step 1 is always from among H1, … , Hk, remark that by the construction of the closed graph
the null hypothesis Hi is contained in the graph
, and that by the construction of Ti every other true null hypothesis in that graph is offspring of Hi in that graph. Therefore no true null other than those in H1, … , Hk can be added to ℋ in step 3 before at least one H1, … , Hk has been rejected. Therefore, as long as no false rejection has been made, ℋ will only contain true null hypotheses from among H1, … , Hk.
To prove 2, remark that if all p1, … , pk are larger than α/k, none of H1, … , Hk will be rejected before the Holm's factor h drops below k. But the Holm's factor will not drop below k before all but k − 1 subtrees
have been fully rejected, which cannot happen as long as H1, … , Hk remain unrejected.
6 COMPUTATIONAL ISSUES
In a practical implementation of the focus level algorithm in a large graph, efforts must be made to reduce the computational burden. In this section we describe some amendments to the algorithm that can be used to reduce this burden, although sometimes at a small loss of power.
The greatest bottleneck of the algorithm is the potentially enormous size of the expanded scope graph
, which can become especially large if some of the focus level nodes have a large number of offspring nodes. To keep
small, it is, therefore, important to keep the expanded subgraphs
small. This can be done for each subgraph by finding a small number of atom sets that allow construction of all offspring sets of the focus level node as unions of these atoms. Using these atoms it is possible to avoid redundant sets when making the closed graphs as in Figure 1, while still including all the original gene sets in the graph. For example, in the data set of Section 7, a closed graph of all 2865 biological process GO terms would contain 22865 nodes, while reducing the 2865 sets to 900 atoms leads to a closed graph of 2900 nodes which still includes all 2865 original sets.
We construct atoms for each subgraph as follows. Start with the a collection
of gene sets. the algorithm is given in Table 2. Note that the atom construction is not necessarily unique and that a more sophisticated algorithm may be able to find a smaller closed graph.
Finding atoms
| Foriin 1, … , k |
|
| Foriin 1, … , k |
|
Finding atoms
| Foriin 1, … , k |
|
| Foriin 1, … , k |
|
It is important to keep the number of atoms small, because the number of nodes in the final closed graph
is 2#atoms. We recommend not choosing any focus level nodes whose subgraphs contain more than 9–12 atoms (≈500–4000 subgraph nodes).
Computation speed can also be gained by avoiding the laborious construction of the whole graph
, most of which will never be reached by the algorithm. We can avoid construction of
by creating each expanded graph
only when it is needed, that is when the ith focus term is declared significant. Avoiding construction of
may entail some loss of power in step 2 of the algorithm, because in that step only known subset relationships from the original GO graph
and from already expanded subgraphs can be exploited for rejection.
These measures allow a good reduction of computation time. Note that due to the stepwise nature of the algorithm, finding only the smallest few adjusted P-values is much quicker than finding all adjusted P-values. In the current R implementation for Global Test, calculating all 2865 adjusted P-values of the Biological process GO graph for the Netherlands Cancer Institute data (see Section 7) takes about 42 min on a 3.2 GHz PC. Calculating only the 100 smallest multiplicity-adjusted P-values takes just 5 min.
7 APPLICATION: BREAST CANCER
We have applied the focus level procedure to the Netherlands Cancer Institute Breast Cancer Data (Van deVijver et al., 2002). This is a data set of 295 breast cancer patients followed up to 18 years. Microarray data (Rosetta technology) are available for these patients. We used the same quality preselection of 4919 genes as the original authors. Survival time and censoring indicators are available for each patient. We are interested in finding GO terms associated with survival of patients. GO information was obtained using the AnnBuilder package (Zhang et al., 2003). Of all GO terms, 2865 from the biological process ontology corresponded to non-empty gene sets in this data set.
The focus level for each of these three graphs was calculated in an automated way. We first found the ‘feasible terms’ as all GO terms whose offspring could all be written in terms of unions of no more than 9 atom sets. Then we defined the focus level as all feasible terms which had no parents among the feasible terms. This gave us a focus level of maximum generality given a reasonable computation time. For the biological process graph the focus level calculated in this way consisted of 589 terms; 262 terms were above the focus level and 2002 below. For calculating raw P-values we used the global test for survival data of Goeman et al. (2005).
Figure 2 compares the multiplicity-adjusted P-values from the focus level method to the raw P-values. The figure clearly shows that the focus level multiplicity adjustment does not preserve the ordering of the raw P-values. The dominant diagonal line around y = x + log10(589) gives the GO terms at or close to the focus level, while the spread of points around the line is of GO terms far from the focus level. The figure also shows examples of multiplicity-adjusted P-values which are smaller than the corresponding raw P-values.
Base 10 logarithm of raw P-values for the biological process GO graph versus base 10 logarithm of multiplicity-adjusted P-values from the focus level procedure. The dotted line is the line of equality.
Base 10 logarithm of raw P-values for the biological process GO graph versus base 10 logarithm of multiplicity-adjusted P-values from the focus level procedure. The dotted line is the line of equality.
Next we compared the number of gene sets found with the focus level method to Holm's procedure, which also controls the family-wise error rate without any assumptions on the joint distribution of the test statistics. Table 3 gives the number of significant GO terms for Holm's (1979) procedure, and for Holm's procedure incorporating the logical relationships in the graph (‘Holm plus’). The latter procedure tests all nodes as Holm's procedure, but rejects not only the term itself but also all its ancestor terms. The table also gives the number of rejected hypothesis for the focus level procedure for different choices of the focus level. The focus levels are defined in relation to the maximum number of atoms below each focus level term, as explained above. Table 3 shows that the focus level procedure rejects more gene sets than Holm's procedure. In this data set it seems preferable to choose a very general focus level: the number of gene sets found when choosing a high focus level is greater than the number found by the extended Holm procedure, while much fewer gene sets are found when choosing a low focus level. Table 4 gives the overlap between the GO terms found by the focus level procedure at different choices of the focus level at α = 0.01. The numbers suggest good agreement between the GO graphs found even at widely different choices of the focus level.
Number of rejected biological process gene ontology terms in the Netherlands Cancer Institute breast cancer data set for Holm's procedure and the focus level procedure at different choices of the focus level
| Procedure . | Number of focus level . | α-level . | |||||
|---|---|---|---|---|---|---|---|
| . | . | 0.1 . | 0.01 . | 0.001 . | 1e−04 . | 1e−05 . | 1e−06 . |
| Holm | 2865 | 439 | 247 | 98 | 38 | 9 | 0 |
| Holm plus | 2865 | 475 | 312 | 161 | 86 | 44 | 0 |
| Bottom up | 840 | 423 | 252 | 135 | 80 | 25 | 0 |
| Focus level 1 | 960 | 466 | 283 | 142 | 73 | 36 | 0 |
| Focus level 2 | 918 | 506 | 305 | 162 | 80 | 37 | 0 |
| Focus level 3 | 851 | 508 | 313 | 164 | 89 | 51 | 26 |
| Focus level 4 | 770 | 513 | 307 | 166 | 89 | 51 | 26 |
| Focus level 5 | 711 | 518 | 317 | 169 | 85 | 53 | 32 |
| Focus level 6 | 668 | 525 | 320 | 172 | 90 | 50 | 26 |
| Focus level 7 | 641 | 536 | 318 | 165 | 87 | 40 | 26 |
| Focus level 8 | 607 | 535 | 333 | 191 | 97 | 49 | 26 |
| Focus level 9 | 589 | 539 | 336 | 197 | 99 | 52 | 26 |
| Focus level 10 | 572 | 543 | 334 | 196 | 98 | 59 | 25 |
| Focus level 11 | 559 | 544 | 342 | 193 | 99 | 63 | 26 |
| Focus level 12 | 534 | 545 | 342 | 191 | 99 | 68 | 26 |
| Procedure . | Number of focus level . | α-level . | |||||
|---|---|---|---|---|---|---|---|
| . | . | 0.1 . | 0.01 . | 0.001 . | 1e−04 . | 1e−05 . | 1e−06 . |
| Holm | 2865 | 439 | 247 | 98 | 38 | 9 | 0 |
| Holm plus | 2865 | 475 | 312 | 161 | 86 | 44 | 0 |
| Bottom up | 840 | 423 | 252 | 135 | 80 | 25 | 0 |
| Focus level 1 | 960 | 466 | 283 | 142 | 73 | 36 | 0 |
| Focus level 2 | 918 | 506 | 305 | 162 | 80 | 37 | 0 |
| Focus level 3 | 851 | 508 | 313 | 164 | 89 | 51 | 26 |
| Focus level 4 | 770 | 513 | 307 | 166 | 89 | 51 | 26 |
| Focus level 5 | 711 | 518 | 317 | 169 | 85 | 53 | 32 |
| Focus level 6 | 668 | 525 | 320 | 172 | 90 | 50 | 26 |
| Focus level 7 | 641 | 536 | 318 | 165 | 87 | 40 | 26 |
| Focus level 8 | 607 | 535 | 333 | 191 | 97 | 49 | 26 |
| Focus level 9 | 589 | 539 | 336 | 197 | 99 | 52 | 26 |
| Focus level 10 | 572 | 543 | 334 | 196 | 98 | 59 | 25 |
| Focus level 11 | 559 | 544 | 342 | 193 | 99 | 63 | 26 |
| Focus level 12 | 534 | 545 | 342 | 191 | 99 | 68 | 26 |
Number of rejected biological process gene ontology terms in the Netherlands Cancer Institute breast cancer data set for Holm's procedure and the focus level procedure at different choices of the focus level
| Procedure . | Number of focus level . | α-level . | |||||
|---|---|---|---|---|---|---|---|
| . | . | 0.1 . | 0.01 . | 0.001 . | 1e−04 . | 1e−05 . | 1e−06 . |
| Holm | 2865 | 439 | 247 | 98 | 38 | 9 | 0 |
| Holm plus | 2865 | 475 | 312 | 161 | 86 | 44 | 0 |
| Bottom up | 840 | 423 | 252 | 135 | 80 | 25 | 0 |
| Focus level 1 | 960 | 466 | 283 | 142 | 73 | 36 | 0 |
| Focus level 2 | 918 | 506 | 305 | 162 | 80 | 37 | 0 |
| Focus level 3 | 851 | 508 | 313 | 164 | 89 | 51 | 26 |
| Focus level 4 | 770 | 513 | 307 | 166 | 89 | 51 | 26 |
| Focus level 5 | 711 | 518 | 317 | 169 | 85 | 53 | 32 |
| Focus level 6 | 668 | 525 | 320 | 172 | 90 | 50 | 26 |
| Focus level 7 | 641 | 536 | 318 | 165 | 87 | 40 | 26 |
| Focus level 8 | 607 | 535 | 333 | 191 | 97 | 49 | 26 |
| Focus level 9 | 589 | 539 | 336 | 197 | 99 | 52 | 26 |
| Focus level 10 | 572 | 543 | 334 | 196 | 98 | 59 | 25 |
| Focus level 11 | 559 | 544 | 342 | 193 | 99 | 63 | 26 |
| Focus level 12 | 534 | 545 | 342 | 191 | 99 | 68 | 26 |
| Procedure . | Number of focus level . | α-level . | |||||
|---|---|---|---|---|---|---|---|
| . | . | 0.1 . | 0.01 . | 0.001 . | 1e−04 . | 1e−05 . | 1e−06 . |
| Holm | 2865 | 439 | 247 | 98 | 38 | 9 | 0 |
| Holm plus | 2865 | 475 | 312 | 161 | 86 | 44 | 0 |
| Bottom up | 840 | 423 | 252 | 135 | 80 | 25 | 0 |
| Focus level 1 | 960 | 466 | 283 | 142 | 73 | 36 | 0 |
| Focus level 2 | 918 | 506 | 305 | 162 | 80 | 37 | 0 |
| Focus level 3 | 851 | 508 | 313 | 164 | 89 | 51 | 26 |
| Focus level 4 | 770 | 513 | 307 | 166 | 89 | 51 | 26 |
| Focus level 5 | 711 | 518 | 317 | 169 | 85 | 53 | 32 |
| Focus level 6 | 668 | 525 | 320 | 172 | 90 | 50 | 26 |
| Focus level 7 | 641 | 536 | 318 | 165 | 87 | 40 | 26 |
| Focus level 8 | 607 | 535 | 333 | 191 | 97 | 49 | 26 |
| Focus level 9 | 589 | 539 | 336 | 197 | 99 | 52 | 26 |
| Focus level 10 | 572 | 543 | 334 | 196 | 98 | 59 | 25 |
| Focus level 11 | 559 | 544 | 342 | 193 | 99 | 63 | 26 |
| Focus level 12 | 534 | 545 | 342 | 191 | 99 | 68 | 26 |
Overlap between the significant GO graphs found by the focus level procedure at different heights of the focus level at α = 0.01 (numbers of GO terms)
| . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 283 | |||||||||||
| 2 | 268 | 305 | ||||||||||
| 3 | 252 | 287 | 313 | |||||||||
| 4 | 246 | 281 | 304 | 307 | ||||||||
| 5 | 250 | 275 | 298 | 300 | 317 | |||||||
| 6 | 250 | 275 | 297 | 299 | 316 | 320 | ||||||
| 7 | 250 | 275 | 294 | 295 | 311 | 315 | 318 | |||||
| 8 | 254 | 278 | 300 | 297 | 312 | 316 | 315 | 333 | ||||
| 9 | 253 | 277 | 300 | 295 | 310 | 313 | 312 | 330 | 336 | |||
| 10 | 252 | 276 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 334 | ||
| 11 | 253 | 277 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 333 | 342 | |
| 12 | 242 | 265 | 287 | 282 | 297 | 300 | 299 | 317 | 322 | 321 | 330 | 342 |
| . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 283 | |||||||||||
| 2 | 268 | 305 | ||||||||||
| 3 | 252 | 287 | 313 | |||||||||
| 4 | 246 | 281 | 304 | 307 | ||||||||
| 5 | 250 | 275 | 298 | 300 | 317 | |||||||
| 6 | 250 | 275 | 297 | 299 | 316 | 320 | ||||||
| 7 | 250 | 275 | 294 | 295 | 311 | 315 | 318 | |||||
| 8 | 254 | 278 | 300 | 297 | 312 | 316 | 315 | 333 | ||||
| 9 | 253 | 277 | 300 | 295 | 310 | 313 | 312 | 330 | 336 | |||
| 10 | 252 | 276 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 334 | ||
| 11 | 253 | 277 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 333 | 342 | |
| 12 | 242 | 265 | 287 | 282 | 297 | 300 | 299 | 317 | 322 | 321 | 330 | 342 |
Focus level height × Focus level height.
Overlap between the significant GO graphs found by the focus level procedure at different heights of the focus level at α = 0.01 (numbers of GO terms)
| . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 283 | |||||||||||
| 2 | 268 | 305 | ||||||||||
| 3 | 252 | 287 | 313 | |||||||||
| 4 | 246 | 281 | 304 | 307 | ||||||||
| 5 | 250 | 275 | 298 | 300 | 317 | |||||||
| 6 | 250 | 275 | 297 | 299 | 316 | 320 | ||||||
| 7 | 250 | 275 | 294 | 295 | 311 | 315 | 318 | |||||
| 8 | 254 | 278 | 300 | 297 | 312 | 316 | 315 | 333 | ||||
| 9 | 253 | 277 | 300 | 295 | 310 | 313 | 312 | 330 | 336 | |||
| 10 | 252 | 276 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 334 | ||
| 11 | 253 | 277 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 333 | 342 | |
| 12 | 242 | 265 | 287 | 282 | 297 | 300 | 299 | 317 | 322 | 321 | 330 | 342 |
| . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 283 | |||||||||||
| 2 | 268 | 305 | ||||||||||
| 3 | 252 | 287 | 313 | |||||||||
| 4 | 246 | 281 | 304 | 307 | ||||||||
| 5 | 250 | 275 | 298 | 300 | 317 | |||||||
| 6 | 250 | 275 | 297 | 299 | 316 | 320 | ||||||
| 7 | 250 | 275 | 294 | 295 | 311 | 315 | 318 | |||||
| 8 | 254 | 278 | 300 | 297 | 312 | 316 | 315 | 333 | ||||
| 9 | 253 | 277 | 300 | 295 | 310 | 313 | 312 | 330 | 336 | |||
| 10 | 252 | 276 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 334 | ||
| 11 | 253 | 277 | 299 | 294 | 309 | 312 | 311 | 329 | 334 | 333 | 342 | |
| 12 | 242 | 265 | 287 | 282 | 297 | 300 | 299 | 317 | 322 | 321 | 330 | 342 |
Focus level height × Focus level height.
Visualization of the significant GO graph obtained by the focus level method can be done using the R package Rgraphviz (Gentry, 2005). Figure 3 shows the significant subgraph of the biological process graph at α = 10−4. This graph gives a succinct picture of the processes most clearly associated with survival of breast cancer patients. It has prominent clusters of processes involved in cell cycle and related processes such as DNA replication.
The significant subgraph of the biological process GO graph according to the focus level procedure at α = 0.0001. See Table 5 for a legend to this figure. The nodes colored grey are focus level nodes.
The significant subgraph of the biological process GO graph according to the focus level procedure at α = 0.0001. See Table 5 for a legend to this figure. The nodes colored grey are focus level nodes.
Legend of Figure 3
| 1 | biological process | 53 | DNA replication |
| 2 | response to stimulus | 54 | DNA recombination |
| 3 | metabolic process | 55 | protein modification |
| 4 | cellular process | 56 | dephosphorylation |
| 5 | biological regulation | 57 | second-messenger-mediated |
| 6 | developmental process | signaling | |
| 7 | response to stress | 58 | M phase |
| 8 | response to endogenous | 59 | interphase |
| stimulus | 60 | mitotic cell cycle | |
| 9 | primary metabolic process | 61 | regulation of progression |
| 10 | macromolecule metabolic | through cell cycle | |
| process | 62 | microtubule-based process | |
| 11 | cellular metabolic process | 63 | chromosome organization and |
| 12 | cell communication | biogenesis (sensu Eukaryota) | |
| 13 | cell division | 64 | chromosome localization |
| 14 | cell cycle | 65 | establishment of organelle |
| 15 | chromosome segregation | localization | |
| 16 | regulation of biological process | 66 | cell death |
| 17 | cell organization and | 67 | DNA-dependent DNA |
| biogenesis | replication | ||
| 18 | localization | 68 | protein amino acid |
| 19 | cellular developmental process | dephosphorylation | |
| 20 | death | 69 | phosphoinositide-mediated |
| 21 | response to DNA damage | signaling | |
| stimulus | 70 | M phase of meiotic cell cycle | |
| 22 | protein metabolic process | 71 | M phase of meiotic cell cycle |
| 23 | nucleobase, nucleoside, | 72 | interphase of mitotic cell cycle |
| nucleotide and nucleic acid | 73 | regulation of progression through | |
| metabolic process | mitotic cell cycle | ||
| 24 | biopolymer metabolic process | 74 | cell cycle checkpoint |
| 25 | cellular macromolecule | 75 | positive regulation of |
| metabolic process | progression through cell cycle | ||
| 26 | phosphorus metabolic process | 76 | microtubule cytoskeleton |
| 27 | signal transduction | organization and biogenesis | |
| 28 | cytokinesis | 77 | sister chromatid segregation |
| 29 | cell cycle process | 78 | establishment of chromosome |
| 30 | regulation of cellular process | localization | |
| 31 | positive regulation of biological | 79 | programmed cell death |
| process | 80 | regulation of a molecular function | |
| 32 | organelle organization and | 81 | mitosis |
| biogenesis | 82 | meiosis | |
| 33 | cellular localization | 83 | G2/M transition of mitotic cell |
| 34 | establishment of localization | cycle | |
| 35 | protein localization | 84 | positive regulation of progression |
| 36 | cell differentiation | through mitotic cell cycle | |
| 37 | DNA metabolic process | 85 | spindle checkpoint |
| 38 | cellular protein metabolic | 86 | spindle organization and |
| process | biogenesis | ||
| 39 | biopolymer modification | 87 | regulation of programmed cell |
| 40 | phosphate metabolic process | death | |
| 41 | intracellular signaling cascade | 88 | apoptosis |
| 42 | meiotic cell cycle | 89 | regulation of catalytic activity |
| 43 | cell cycle phase | 90 | mitotic sister chromatid |
| 44 | regulation of cell cycle | segregation | |
| 45 | positive regulation of cellular | 91 | regulation of mitosis |
| process | 92 | chromosome condensation | |
| 46 | cytoskeleton organization and | 93 | regulation of apoptosis |
| biogenesis | 94 | regulation of hydrolase activity | |
| 47 | chromosome organization and | 95 | mitotic chromosome |
| biogenesis | condensation | ||
| 48 | organelle localization | 96 | regulation of exit from mitosis |
| 49 | establishment of cellular | 97 | regulation of caspase activity |
| localization | 98 | positive regulation of exit from | |
| 50 | protein complex localization | mitosis | |
| 51 | cell development | 99 | negative regulation of caspase |
| 52 | DNA repair | activity |
| 1 | biological process | 53 | DNA replication |
| 2 | response to stimulus | 54 | DNA recombination |
| 3 | metabolic process | 55 | protein modification |
| 4 | cellular process | 56 | dephosphorylation |
| 5 | biological regulation | 57 | second-messenger-mediated |
| 6 | developmental process | signaling | |
| 7 | response to stress | 58 | M phase |
| 8 | response to endogenous | 59 | interphase |
| stimulus | 60 | mitotic cell cycle | |
| 9 | primary metabolic process | 61 | regulation of progression |
| 10 | macromolecule metabolic | through cell cycle | |
| process | 62 | microtubule-based process | |
| 11 | cellular metabolic process | 63 | chromosome organization and |
| 12 | cell communication | biogenesis (sensu Eukaryota) | |
| 13 | cell division | 64 | chromosome localization |
| 14 | cell cycle | 65 | establishment of organelle |
| 15 | chromosome segregation | localization | |
| 16 | regulation of biological process | 66 | cell death |
| 17 | cell organization and | 67 | DNA-dependent DNA |
| biogenesis | replication | ||
| 18 | localization | 68 | protein amino acid |
| 19 | cellular developmental process | dephosphorylation | |
| 20 | death | 69 | phosphoinositide-mediated |
| 21 | response to DNA damage | signaling | |
| stimulus | 70 | M phase of meiotic cell cycle | |
| 22 | protein metabolic process | 71 | M phase of meiotic cell cycle |
| 23 | nucleobase, nucleoside, | 72 | interphase of mitotic cell cycle |
| nucleotide and nucleic acid | 73 | regulation of progression through | |
| metabolic process | mitotic cell cycle | ||
| 24 | biopolymer metabolic process | 74 | cell cycle checkpoint |
| 25 | cellular macromolecule | 75 | positive regulation of |
| metabolic process | progression through cell cycle | ||
| 26 | phosphorus metabolic process | 76 | microtubule cytoskeleton |
| 27 | signal transduction | organization and biogenesis | |
| 28 | cytokinesis | 77 | sister chromatid segregation |
| 29 | cell cycle process | 78 | establishment of chromosome |
| 30 | regulation of cellular process | localization | |
| 31 | positive regulation of biological | 79 | programmed cell death |
| process | 80 | regulation of a molecular function | |
| 32 | organelle organization and | 81 | mitosis |
| biogenesis | 82 | meiosis | |
| 33 | cellular localization | 83 | G2/M transition of mitotic cell |
| 34 | establishment of localization | cycle | |
| 35 | protein localization | 84 | positive regulation of progression |
| 36 | cell differentiation | through mitotic cell cycle | |
| 37 | DNA metabolic process | 85 | spindle checkpoint |
| 38 | cellular protein metabolic | 86 | spindle organization and |
| process | biogenesis | ||
| 39 | biopolymer modification | 87 | regulation of programmed cell |
| 40 | phosphate metabolic process | death | |
| 41 | intracellular signaling cascade | 88 | apoptosis |
| 42 | meiotic cell cycle | 89 | regulation of catalytic activity |
| 43 | cell cycle phase | 90 | mitotic sister chromatid |
| 44 | regulation of cell cycle | segregation | |
| 45 | positive regulation of cellular | 91 | regulation of mitosis |
| process | 92 | chromosome condensation | |
| 46 | cytoskeleton organization and | 93 | regulation of apoptosis |
| biogenesis | 94 | regulation of hydrolase activity | |
| 47 | chromosome organization and | 95 | mitotic chromosome |
| biogenesis | condensation | ||
| 48 | organelle localization | 96 | regulation of exit from mitosis |
| 49 | establishment of cellular | 97 | regulation of caspase activity |
| localization | 98 | positive regulation of exit from | |
| 50 | protein complex localization | mitosis | |
| 51 | cell development | 99 | negative regulation of caspase |
| 52 | DNA repair | activity |
Legend of Figure 3
| 1 | biological process | 53 | DNA replication |
| 2 | response to stimulus | 54 | DNA recombination |
| 3 | metabolic process | 55 | protein modification |
| 4 | cellular process | 56 | dephosphorylation |
| 5 | biological regulation | 57 | second-messenger-mediated |
| 6 | developmental process | signaling | |
| 7 | response to stress | 58 | M phase |
| 8 | response to endogenous | 59 | interphase |
| stimulus | 60 | mitotic cell cycle | |
| 9 | primary metabolic process | 61 | regulation of progression |
| 10 | macromolecule metabolic | through cell cycle | |
| process | 62 | microtubule-based process | |
| 11 | cellular metabolic process | 63 | chromosome organization and |
| 12 | cell communication | biogenesis (sensu Eukaryota) | |
| 13 | cell division | 64 | chromosome localization |
| 14 | cell cycle | 65 | establishment of organelle |
| 15 | chromosome segregation | localization | |
| 16 | regulation of biological process | 66 | cell death |
| 17 | cell organization and | 67 | DNA-dependent DNA |
| biogenesis | replication | ||
| 18 | localization | 68 | protein amino acid |
| 19 | cellular developmental process | dephosphorylation | |
| 20 | death | 69 | phosphoinositide-mediated |
| 21 | response to DNA damage | signaling | |
| stimulus | 70 | M phase of meiotic cell cycle | |
| 22 | protein metabolic process | 71 | M phase of meiotic cell cycle |
| 23 | nucleobase, nucleoside, | 72 | interphase of mitotic cell cycle |
| nucleotide and nucleic acid | 73 | regulation of progression through | |
| metabolic process | mitotic cell cycle | ||
| 24 | biopolymer metabolic process | 74 | cell cycle checkpoint |
| 25 | cellular macromolecule | 75 | positive regulation of |
| metabolic process | progression through cell cycle | ||
| 26 | phosphorus metabolic process | 76 | microtubule cytoskeleton |
| 27 | signal transduction | organization and biogenesis | |
| 28 | cytokinesis | 77 | sister chromatid segregation |
| 29 | cell cycle process | 78 | establishment of chromosome |
| 30 | regulation of cellular process | localization | |
| 31 | positive regulation of biological | 79 | programmed cell death |
| process | 80 | regulation of a molecular function | |
| 32 | organelle organization and | 81 | mitosis |
| biogenesis | 82 | meiosis | |
| 33 | cellular localization | 83 | G2/M transition of mitotic cell |
| 34 | establishment of localization | cycle | |
| 35 | protein localization | 84 | positive regulation of progression |
| 36 | cell differentiation | through mitotic cell cycle | |
| 37 | DNA metabolic process | 85 | spindle checkpoint |
| 38 | cellular protein metabolic | 86 | spindle organization and |
| process | biogenesis | ||
| 39 | biopolymer modification | 87 | regulation of programmed cell |
| 40 | phosphate metabolic process | death | |
| 41 | intracellular signaling cascade | 88 | apoptosis |
| 42 | meiotic cell cycle | 89 | regulation of catalytic activity |
| 43 | cell cycle phase | 90 | mitotic sister chromatid |
| 44 | regulation of cell cycle | segregation | |
| 45 | positive regulation of cellular | 91 | regulation of mitosis |
| process | 92 | chromosome condensation | |
| 46 | cytoskeleton organization and | 93 | regulation of apoptosis |
| biogenesis | 94 | regulation of hydrolase activity | |
| 47 | chromosome organization and | 95 | mitotic chromosome |
| biogenesis | condensation | ||
| 48 | organelle localization | 96 | regulation of exit from mitosis |
| 49 | establishment of cellular | 97 | regulation of caspase activity |
| localization | 98 | positive regulation of exit from | |
| 50 | protein complex localization | mitosis | |
| 51 | cell development | 99 | negative regulation of caspase |
| 52 | DNA repair | activity |
| 1 | biological process | 53 | DNA replication |
| 2 | response to stimulus | 54 | DNA recombination |
| 3 | metabolic process | 55 | protein modification |
| 4 | cellular process | 56 | dephosphorylation |
| 5 | biological regulation | 57 | second-messenger-mediated |
| 6 | developmental process | signaling | |
| 7 | response to stress | 58 | M phase |
| 8 | response to endogenous | 59 | interphase |
| stimulus | 60 | mitotic cell cycle | |
| 9 | primary metabolic process | 61 | regulation of progression |
| 10 | macromolecule metabolic | through cell cycle | |
| process | 62 | microtubule-based process | |
| 11 | cellular metabolic process | 63 | chromosome organization and |
| 12 | cell communication | biogenesis (sensu Eukaryota) | |
| 13 | cell division | 64 | chromosome localization |
| 14 | cell cycle | 65 | establishment of organelle |
| 15 | chromosome segregation | localization | |
| 16 | regulation of biological process | 66 | cell death |
| 17 | cell organization and | 67 | DNA-dependent DNA |
| biogenesis | replication | ||
| 18 | localization | 68 | protein amino acid |
| 19 | cellular developmental process | dephosphorylation | |
| 20 | death | 69 | phosphoinositide-mediated |
| 21 | response to DNA damage | signaling | |
| stimulus | 70 | M phase of meiotic cell cycle | |
| 22 | protein metabolic process | 71 | M phase of meiotic cell cycle |
| 23 | nucleobase, nucleoside, | 72 | interphase of mitotic cell cycle |
| nucleotide and nucleic acid | 73 | regulation of progression through | |
| metabolic process | mitotic cell cycle | ||
| 24 | biopolymer metabolic process | 74 | cell cycle checkpoint |
| 25 | cellular macromolecule | 75 | positive regulation of |
| metabolic process | progression through cell cycle | ||
| 26 | phosphorus metabolic process | 76 | microtubule cytoskeleton |
| 27 | signal transduction | organization and biogenesis | |
| 28 | cytokinesis | 77 | sister chromatid segregation |
| 29 | cell cycle process | 78 | establishment of chromosome |
| 30 | regulation of cellular process | localization | |
| 31 | positive regulation of biological | 79 | programmed cell death |
| process | 80 | regulation of a molecular function | |
| 32 | organelle organization and | 81 | mitosis |
| biogenesis | 82 | meiosis | |
| 33 | cellular localization | 83 | G2/M transition of mitotic cell |
| 34 | establishment of localization | cycle | |
| 35 | protein localization | 84 | positive regulation of progression |
| 36 | cell differentiation | through mitotic cell cycle | |
| 37 | DNA metabolic process | 85 | spindle checkpoint |
| 38 | cellular protein metabolic | 86 | spindle organization and |
| process | biogenesis | ||
| 39 | biopolymer modification | 87 | regulation of programmed cell |
| 40 | phosphate metabolic process | death | |
| 41 | intracellular signaling cascade | 88 | apoptosis |
| 42 | meiotic cell cycle | 89 | regulation of catalytic activity |
| 43 | cell cycle phase | 90 | mitotic sister chromatid |
| 44 | regulation of cell cycle | segregation | |
| 45 | positive regulation of cellular | 91 | regulation of mitosis |
| process | 92 | chromosome condensation | |
| 46 | cytoskeleton organization and | 93 | regulation of apoptosis |
| biogenesis | 94 | regulation of hydrolase activity | |
| 47 | chromosome organization and | 95 | mitotic chromosome |
| biogenesis | condensation | ||
| 48 | organelle localization | 96 | regulation of exit from mitosis |
| 49 | establishment of cellular | 97 | regulation of caspase activity |
| localization | 98 | positive regulation of exit from | |
| 50 | protein complex localization | mitosis | |
| 51 | cell development | 99 | negative regulation of caspase |
| 52 | DNA repair | activity |
8 APPLICATION: AML
A second application using Global Ancova (Hummel et al., 2007; Mansmann and Meister, 2005) on an AML data set (Schnittger et al., 2005) is presented in the Supplemental information.
9 DISCUSSION
Two things are important when adjusting for multiple testing in the Gene Ontology graph. In the first place, the GO graph is structured, and it is wasteful not to make use of that structure. Secondly, due to the unknown and possibly highly irregular dependency structures between test statistics, it is important not to make any assumptions on the joint distribution of the test statistics. In this article we have presented the focus level method: a multiple testing method that controls the family-wise error rate on the GO graph that makes use of the graph structure and makes no further assumptions. Because the procedure makes use of the logical structure of the GO graph, the procedure has more power than traditional procedures such as Holm's. To our knowledge the focus level method is the first method that allows the structure of Gene Ontology to be used when correcting for multiple testing.
The significant gene sets that come out of the focus level methods can be presented as a subgraph of the GO graph. This makes it easy to interpret the results of the experiment, as the significant gene sets are always presented in context. The focus level method tends to produce coherent graphs because the algorithm favors finding significant gene sets close to other significant sets.
The focus level method requires the user to specify a focus level, which is the initial level of the graph in which the algorithm searches for significant sets; it only delves deeper into the graph at places where significance is found. The choice of a focus level gives the researcher some control over the multiple testing procedure, specifying at which level of detail in the GO graph it should have most power. This flexibility of the procedure is an important asset of the focus level method, as it allows for more biological input into the statistical analysis. There are some computational constraints that exclude some of the general nodes as focus level nodes, but in our experience these constraints are flexible enough to fit the needs of most researchers.
Because the focus level method preserves the structure of the GO graph when correcting for multiple testing, it does not preserve the ordering of the raw P-values. Therefore, adjusted P-values from the focus level method should never be interpreted individually, but only in the context of the significant GO graph, because the interpretation of an individual adjusted P-value should depend on the location in the graph where it occurs. The primary result of the focus level method is a significant GO subgraph, not a list of adjusted P-values. If the primary interest is in individual GO nodes instead of in the significant graph as a whole, other multiple testing correction methods may be more appropriate.
The focus level procedure controls the family-wise error rate rather than the false discovery rate (Benjamini and Hochberg, 1995). This is because the concept of the false discovery rate was designed for an unstructured collection of potential discoveries, in which discoveries are exchangeable and it makes sense to talk about a rate of false discoveries. This concept cannot be immediately transferred to a hierarchical graph-like structure with logical relationships. Control of the false discovery rate in the closed graph (Figure 1), for example, does not imply control of the false discovery rate in the original graph. The concept of the family-wise error rate, on the other hand, is easily applicable to graphs and much more consistent with the idea that the graph that results from the focus level procedure in fact constitutes a single ‘discovery’.
It is a frequently voiced general criticism of GO that it does not represent an objective view of biology. The genes annotated are mostly the result of disease related research and the construction of GO is strongly biased in this direction. However, this criticism does not reduce the need for statistical procedures on acyclic graphs derived from ontologies, because ontologies will remain an essential tool for organizing complex knowledge in genomic research. The methods presented in this article for GO can easily be applied to more reliable ontologies that will be available in the future.
ACKNOWLEDGEMENTS
The authors thank Prof. Laura van 't Veer (Netherlands Cancer Institute) and Prof. Stephan Bohlander (Internal Medicine III, University Hospital, University of Munich) for kindly providing their data. The research of J.J.G. was supported by the Netherlands Bioinformatics Centre, Biorange project 1.3.2. The research of UM was supported by the German National Genome Research Network, project 01 GR 0459.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: David Rocke










