A predicted interactome for Arabidopsis.

The complex cellular functions of an organism frequently rely on physical interactions between proteins. A map of all protein-protein interactions, an interactome, is thus an invaluable tool. We present an interactome for Arabidopsis (Arabidopsis thaliana) predicted from interacting orthologs in yeast (Saccharomyces cerevisiae), nematode worm (Caenorhabditis elegans), fruitfly (Drosophila melanogaster), and human (Homo sapiens). As an internal quality control, a confidence value was generated based on the amount of supporting evidence for each interaction. A total of 1,159 high confidence, 5,913 medium confidence, and 12,907 low confidence interactions were identified for 3,617 conserved Arabidopsis proteins. There was significant coexpression of genes whose proteins were predicted to interact, even among low confidence interactions. Interacting proteins were also significantly more likely to be found within the same subcellular location, and significantly less likely to be found in conflicting localizations than randomly paired proteins. A notable exception was that proteins located in the Golgi were more likely to interact with Golgi, vacuolar, or endoplasmic reticulum sorted proteins, indicating possible docking or trafficking interactions. These predictions can aid researchers by extending known complexes and pathways with candidate proteins. In addition we have predicted interactions for many previously unknown proteins in known pathways and complexes. We present this interactome, and an online Web interface the Arabidopsis Interactions Viewer, as a first step toward understanding global signaling in Arabidopsis, and to whet the appetite for those who are awaiting results from high-throughput experimental approaches.

High-throughput experiments have resolved genome scale networks of protein-protein interactions (PPIs; interactomes) in yeast (Saccharomyces cerevisiae), fruitfly (Drosophila melanogaster), nematode worm (Caenorhabditis elegans), and human (Homo sapiens; Uetz et al., 2000;Giot et al., 2003;Li et al., 2004;Miller et al., 2005;Rual et al., 2005;Gandhi et al., 2006). These interactomes have revealed protein transactions in biological processes and relatedness of interacting partners. Interactomics is quickly becoming a valuable new area of systems biology by comprehensively deducing the networks of PPIs that form the basis for much of signaling and regulatory control as well as the machinery of cellular function.
Where the cost of a high-throughput experimental approach is prohibitive, a computational alternative is often a useful preliminary step, especially when combined with literature extraction of all published protein interactions. The Online Predicted Human Interaction Database (Brown and Jurisica, 2005) combines extensive literature search from the Human Protein Resource Database (http://www.hprd.org) and predictions of interacting orthologs (interologs) derived from yeast and fruitfly (Krogan et al., 2006). Predicted interactomes are deduced from experimental interactomes of other species. A pair of interologs in the reference species predicts an interaction in the test species. This method relies on accurately predicting orthologous genes using similarity cutoffs and prediction algorithms such as INPARANOID (http://inparanoid.cgb. ki.se), and not simply best blast score (O'Brien et al., 2005). Because of this limitation, an interactome predicted from interologs will show interactions among the most conserved proteins. Fortunately many pathways such as endomembrane trafficking and small GTPase signaling actively being studied show significant conservation among eukaryotes (Carter et al., 2004;Chang and Philips, 2006).
In a similar approach, the gene-coexpression network can be built by examining coexpression of genes across a wide number of tissues and experiments (Hanisch et al., 2002). This provides useful information about genes likely to be involved in the same biological processes in humans, mouse, Escherichia coli, and yeast (Bhardwaj and Lu, 2005).
Although plant protein interaction networks based on literature mining and coexpression of neighboring Arabidopsis (Arabidopsis thaliana) genes have been recently reviewed, there is currently no publicly available large-scale plant interactome (Uhrig, 2006;Williams and Bowles, 2006). In this article, we present predicted Arabidopsis protein interactome based on the interolog method. We have shown that these predicted interacting proteins are significantly colocalized and coexpressed by analyzing existing experimental data from Arabidopsis. We have recapitulated many known signaling pathways and protein complexes in Arabidopsis and have extended by adding new and often unknown proteins into existing networks. In this way we provide an avenue to expand the current understanding of signaling and cellular function by enabling hypothesis generation based on our predicted Arabidopsis interactome.

Building a Predicted Interactome
PPIs, which are the basis of intracellular signaling and regulation, were predicted on the assumption that evolutionarily conserved proteins would tend to have conserved interactions. The process began by using the ortholog predicting algorithm INPARANOID (Remm et al., 2001;O'Brien et al., 2005) and using automatic annotation from ENSEMBL (Birney et al., 2004;Curwen et al., 2004) to identify Arabidopsis proteins orthologous to proteins in yeast, nematode worm, fruitfly, and human. Partial interactomes from these four species have been experimentally determined (Giot et al., 2003;Li et al., 2004;Rual et al., 2005;Gandhi et al., 2006). A predicted interaction was established for Arabidopsis where orthologs existed for both interactive proteins in one of these four established interactomes. This process, outlined in Figure 1, is known as interactionortholog (or interolog) mapping and is an established method of predicting interactomes (Lehner and Fraser, 2004;Yu et al., 2004). Interologs are thus a prediction without direct experimental verification, but none the less a good place to begin. Using this method we identified 19,979 predicted interactions for 3,617 Arabidopsis proteins (Supplemental Table S1). Of these 611 are predicted self interactions (homodimers) and 19,368 are interactions between different proteins (heterodimers).
As an internal quality control we established a confidence value (CV; Supplemental Table S1). Our CV is established individually for each pair of interacting proteins and is based on the product of: (1) In how many different datasets was the interaction predicted; (2) How many different kinds of experiments supported this interaction; and (3) In how many (out of four) species was this interaction found. With this assessment, we have identified 1,159 high confidence interactions (CV . 10), 5,913 medium confidence (CV between 2 and 10), and 12,907 low confidence interactions (CV 5 1).
Predicted Arabidopsis interacting protein pairs (from Supplemental Table S1) were loaded into the network building programs OSPREY and CYTO-SCAPE (see ''Materials and Methods'') to visualize interaction pathways. Surprisingly, 3,482 of the set of 3,617 conserved proteins were connected into a single interconnected network ( Fig. 2A). Many proteins have a high number of interacting partners, including perhaps predictably ubiquitin-related proteins and members of the 26S proteosome, but also a Ras-related GTPase (At2g2290) and CDC2A, members of known signaling pathways (Table I). To analyze the topology of the network, proteins were divided into free ends (with only one interaction), pipes (two interactions), and hubs of different size (demonstrated in Fig. 2C). The distribution of hub sizes was logarithmic (Supplemental Table S1), however, when broken down by Figure 1. Flowchart for the predicted Arabidopsis interactome. A list of Arabidopsis orthologs were identified using INPARANOID and ENSEMBL algorithms (see ''Materials and Methods'') from genome databases of yeast, nematode, fruitfly, and human. Where orthologs were found for both partners of a known protein interaction in the reference species, that interaction was mapped to (i.e. replaced with) corresponding Arabidopsis genes. This generated the Arabidopsis predicted interactome and a CV based on the amount of supporting evidence. Subsequent verification and analysis examined each interaction protein pair using Pearson correlation of gene expression profiles in an Arabidopsis transcriptome database (AtGenExpress) and checked for colocalization using SUBA. [See online article for color version of this figure. ] class, the largest class of proteins were medium hubs (Fig. 2D). Interacting proteins had an average of 11 interacting partners, which is smaller than is found in yeast (average 22 interacting partners), but comparable to Drosophila (average nine interacting partners). As we are only looking at evolutionarily conserved interactions and not any plant-specific interactions, pipes (two interacting partners) and free ends (single interacting partner) could easily be underrepresented (see ''Materials and Methods''). When super and major hubs (.50 interactions; 116 proteins total) were removed from the interactome and the network was reconstructed, 3,230 (92%) of the remaining proteins still held together in a single network, with dozens of disconnected subnets of two to five proteins (data not shown). While this is similar to the observation in yeast that the network integrity is held together by smaller hubs called the stratus structure (Batada et al., 2006), this should perhaps not be surprising as many of these interologs are based on yeast. Only 292 interactions have been found in two interactome datasets, usually yeast and Drosophila. The nematode worm and human interactome data experimentally derived is relatively incomplete, thus accounting for poor overlap with other datasets. At a minimum the proteins that generated the stratus structure in yeast have orthologs in Arabidopsis, and thus our prediction is for a stratus structure.

Extending Known Pathways and Complexes
Many of the Arabidopsis interologs were predicted from multiple species and interacting proteins fall into known complexes such as DNA repair and RNA splicing (Table II). Notably, some of the most evolutionarily conserved interactions included proteins with no previously known function, such as At5g27740, whose orthologs in human, yeast, and fruitfly interact with an AAA-type (ATPase associated with a variety of cellular activities) ATPase, also found in Arabidopsis (Table II, line 9). Thus it may be possible to extend known pathways or identify unknown members of protein complexes in Arabidopsis and assign putative function on the basis of its interacting partner's function. This type of annotation (i.e. predicted to interact with X) would extend functional annotation of the Arabidopsis genome.
A small interactome for Arabidopsis built by extensively mining the literature is available in the BIND database (Bader et al., 2001 and contains some 356 proteins and 711 interactions. Only 95 of the BIND proteins are found among the orthologs used in our predicted interactome. These 95 proteins had 85 interactions in BIND, of which 30 are also predicted by our method, which is significant overlap compared to an expected value of 1.3 (see ''Materials and Methods''). Known interactions between OSMOTIC SENSITIVE1, syntaxin, and v-SNARE proteins formed a small network of vacuolar and Golgi localized proteins (Fig. 3, blue edges). When this network was extended based on interolog prediction, 20 new proteins were putatively added to this network, including many other Golgi and vacuolar proteins, more syntaxins, SNAREs, and SNAP (soluble NSF attachment protein) proteins, but also an ATPase, heat shock protein-83, protein phosphatases, and the RAS-related protein ARA5 (Fig.  3, red edges). When extended with interologs, known pathways for RAS and RHO-like GTPases (Supplemental Fig. S1) gained several not so surprising members (i.e. ROP and other RHO-GAPs). A few interesting members were also identified. For example an unknown NCK1-like SH3 domain protein, a key protein interaction and signaling domain in humans (Wu et al., 2007), and associated with vesicle trafficking in Arabidopsis (Lam et al., 2001). Also a PH (pleckstrin homology) domain protein was identified that may also have a role in vesicle trafficking or lipid signaling (Lee et al., 2002;Tang et al., 2005). Similarly, interactions for the KNAT/STM/BELL homeotic transcription factors and the RNA splicing machinery were extended using interolog prediction (Supplemental Figs. S2 and S3). These new protein interactions are only predictions and demonstrate that interologs can be of some immediate use in generating a list of candidate genes when trying to reassemble protein complexes and signaling pathways for experimental verification.

Subcellular Localization of Interologs
To interact, interacting proteins should in general reside in the same subcellular location, although some proteins will interact across adjacent subcellular locations (i.e. cytosol-membrane associated) and some will migrate between compartments and could have interaction partners in both locations (i.e. nucleus-cytosol). Proteins in the predicted Arabidopsis interactome were assigned to a subcellular location using data from The Arabidopsis Subcellular Database (SUBA; Heazlewood et al., 2005Heazlewood et al., , 2007. We then found those interologs for which both interacting proteins possessed data from SUBA on subcellular localization and were not self-interacting proteins. Subcellular localization data was available for 2,623 interologs, corresponding to 918 unique proteins. Figure 4 shows the numbers of interologs as a function of the subcellular localization of their interacting proteins. P values associated with the deviation of these counts from a random interactome network with the same properties (see ''Materials and Methods'') are illustrated in Figure  4. There is a statistically significant enrichment of interologs for which both proteins pairs are in the same compartment for all compartments except the extracellular space, for which there is very little data. These results indicate that as expected, pairs of proteins predicted to interact tend to reside in the same location. The only other pairs of locations in Figure 4 with an enrichment of interologs are the Golgi apparatus/endoplasmic reticulum (ER) and Golgi apparatus/vacuole. In both of these cases the enrichment can be partially explained by the experimental difficulties in distinguishing proteins in these compartments (Dunkley et al., 2006), which will be reflected in the data in SUBA used for the analysis.
Some of the Golgi/ER and Golgi/vacuole protein interaction enrichment can also be attributed to real interactions between members of complexes involved in the endomembrane trafficking pathway (i.e. in Fig.  3). Gandhi et al. (2006) also note a strong enrichment of interacting proteins between these organelles in their recent study of the human interactome. In contrast to the enrichment found, there is a significant depletion of interologs for which one protein is nuclear and the other from the cytoplasm, ER, Golgi apparatus, mitochondria, peroxisome, or vacuole. As interactions between proteins in these compartments are unlikely, these results also confirm expectations.

Coexpression of Interologs
Proteins that interact could be expected to possess similar or complementary gene expression profiles (for example, see Ge et al., 2001.
Consequently, if an interolog pair is positively coexpressed, this strengthens the confidence in the prediction that the pair interacts. The lack of correlation does not necessarily imply that the interologs do not interact. Specifically, it could be the case that one member is constitutively expressed while the other interacting partner is only expressed under certain conditions. It is also possible that an existing protein might relocalize to a new compartment, undergo allosteric regulation, or even move extracellularly or translocate to other tissues, thus propagating a signal or interaction without a change in mRNA level. Coexpression of genes was computed by applying the Pearson correlation coefficient (r) to expression data for a gene pair. The gene expression analysis exhibited a strong and statistically significant trend (P , 10 221 , using a twosample Kolmogorov-Smirnov test, see ''Materials and Methods'') of coexpression for the interolog pairs when compared to random gene pairs drawn from all Arabidopsis coding sequences (Fig. 5A). As well, we were able to visualize a positive correlation between interolog pair coexpression and the interolog CV (Fig. 5B). We have incorporated known interactions  and our predicted interolog data into the Arabidopsis Interaction Viewer at http:// bbc.botany.utoronto.ca/interactions/ and into the outputs from the Expression Browser and Expression Angler tools of the Botany Array Resource (Toufighi  Figure 6. It is thus possible to easily see if two or more genes that are coexpressed are interologs or interactors. Additionally, genes with unknown function that are flagged as interologs and coexpress with known genes are very likely to be involved in the gene of known function's biological process, thus aiding hypothesis generation.

How to Use the Predicted Interactome
A predicted interactome has been made for Arabidopsis, based on evolutionary conservation of protein interactions across species. Each interaction has been assigned a CV based on the number of organisms and experiments it is supported by. Assignments for subcellular localization and coexpression can be used as further indicators of confidence in a predicted interaction. Interacting proteins tend to be colocalized to the same compartment, or to adjacent compartments such as Golgi vacuole and Golgi ER. As many of these proteins are part of the endomembrane trafficking complexes such as SNARE-SNAP-syntaxin (Fig. 3), these intercompartment interactions could represent docking or trafficking interactions. Interacting proteins also tend to be highly coexpressed across tissues and organs and in response to hormone and stress treatments.
To make use of this resource, an interactome network file is built using a network assembly and visualization tool. Alternately, users may query the Arabidopsis Interaction Viewer at http://bbc.botany. utoronto.ca/interactions/. The raw data to build the network has been provided (Supplemental Table S1). Both Osprey 1.2 (http://biodata.mshri.on.ca; Breitkreutz et al., 2003) and Cytoscape 2.4.1 (http://www.cytoscape. org; Shannon et al., 2003) are publicly available tools well suited to browse the Arabidopsis predicted interactome. Network files have been prebuilt for both of these tools and have been included as Supplemental Data (api.osp and api.cys), and are also available upon request. These can simply be loaded into the Figure 3. SNARE-syntaxin network expanded by predicted interactions. Proteins with known, experimentally determined interactions (blue lines) from the BIND dataset formed an initial set. This was expanded one layer outwards by identifying all proteins that are predicted to interact with proteins from the initial set. All predicted interactions are rated by CV (line thickness) and coexpression (line color). Nodes are color coded with predicted subcellular localizations and sized according to the number of predicted interacting protein partners throughout the entire predicted interactome. Note that the interaction between OSM1 and VTI12 is both predicted and experimentally determined (both red and blue lines connect these nodes).
appropriate tool and one can begin browsing or searching the interactome right away. Cytoscape, used in this work, offers more visualization options although Osprey is a little easier to navigate for the uninitiated. In addition to the Cytoscape network file, other files containing layout, node, and edge attributes are available. In our visualization, nodes are given both The Arabidopsis Information Resource (TAIR) annotation (AtXgNNNNN) and common gene names as interchangeable attributes, the size of the node is related to the number of interactions, while the color of the node is its SUBA assignment for subcellular . Subcellular localization of protein interactions. A, A network subset of medium confidence interacting proteins where proteins were assigned to a subcellular compartment in the SUBA database. B, Analysis of all interacting protein pairs in which both partners were assigned to a subcellular compartment. The numbers of individual protein numbers is in italics beside compartment names. Compartment pairs that showed enriched or depleted numbers of interactions (compared to chance) are color coded. For example, there is a significant (P , 0.01) enrichment of interactions in which both partners are nuclear localized, while there is a significant depletion of interactions between nuclear and vacuolar localized proteins. Chloro, Chloroplast; Cyskel, cytoskeleton; Excell, extracellular; Mito, mitochondria; Perox, peroxisome. Figure 5. Coexpression of interologs. A, The PCC for 19,979 predicted interaction pairs was calculated and plotted as the number of pairs in each Pearson correlation coefficient range, with an r unit bin size of 0.1 (blue points). The correlation coefficient calculation was also performed for 20,000 randomly selected pairs of Arabidopsis genes from within our interactome (green points), from all AGI IDs on the ATH1 GeneChip (red points), or from all AGI IDs on the ATH1 GeneChip such that the topology of the random network was the same as that of our predicted interactome (magenta points). Note that not all gene pairs mapped to probe sets on the Affymetrix ATH1 Gene Chip. The gene expression set used is an compendium of the four smaller AtGenExpress compendia displayed in the Expression Browser tool at http://bbc.botany.utoronto.ca. These include data sets generated by Schmid et al. (2005), Kilian et al. (2007), and other members of the AtGenExpress consortium. Genes with a high PCC are considered to be coexpressed. The interolog distribution is shown to contain many coexpressed pairs. B, The interolog CV was plotted against the correlation coefficient for each pair, demonstrating that a high confidence score (score $ 11) may suggest that the interolog pair is   Table S1 and including a column of whatever attribute is to be assigned to each interaction or each protein. While the interactome in its entirety initially appears as a giant hairy ball, it is possible to zoom into each section, or more profitably to use filters to look at specific genes and all surrounding interactions (i.e. by using depth filter in Osprey or filter dialog in Cytoscape). This allows researchers the ability to build and extend their own pathway or protein complex using this Arabidopsis predicted interactome. For custom queries using the Arabidopsis Interactions Viewer, the user may also download a file for use in Cytoscape, or explore the network within a SVG-plugin-enabled Web browser.

Universality and Ancestry of Some Protein Interaction Pathways
Predicting the Arabidopsis interactome relies on some universality of protein function among eukaryotes, especially as these predictions are based on interactions of nonplant species. Using the INPARANOID ortholog prediction algorithm, we identified 3,206 genes for which orthologs were found in all five eukaryotic species (Arabidopsis, yeast, nematode worm, fruitfly, and humans), which is a significant fraction (approximately 10%) of the genome, and another 7,570 Arabidopsis genes had an ortholog in at least one other species. We also identified 292 interactions that were present in at least two different species and orthologous genes found in Arabidopsis (Supplemental Table S1), which is large considering the incomplete nature of interactome datasets and small pool of overlapping orthologs. We have estimated that conservatively there are 100,000 to 200,000 interactions if we assume that the number of noninteracting proteins is proportionately similar to yeast, and that conserved interologs discovered here are significantly more interactive than nonconserved genes.
Interestingly, the most highly conserved interactions tend to be those between two highly connected hubs. Is there an increased likelihood that the duplication or loss of that hub is lethal or deleterious when one protein interacts with many others compared to loss or duplication of a single protein interaction? If so, highly connected hubs may thus be under more evolutionary pressure to remain conserved, while smaller hubs are free to duplicate and diverge. Hubs with 50 or more interacting partners tended to be enriched in nuclear localization, DNA, and RNA metabolism, although not transcription factors themselves, and 11% of all large hubs are lethal genes or otherwise indispensable (Tzafrir et al., 2004), while this applies to only 6% of intermediate hubs and 5% of pipes and loose ends (Supplemental Fig. S5). Core members of a protein complex are highly coexpressed and often lethal if mutated (Dezso et al., 2003). These core proteins are surrounded by a cloud of transiently docking peripheral proteins that are less likely to be lethal if individually deleted. Highly connected hubs may thus represent these conserved cores of signaling complexes.
Another surprise was the chloroplast localization of many of the interologs. These are all nuclear encoded, but chloroplast localized proteins. While photosynthesis is absent from the species used to build the orthologs, the complex phylogenetic origin of proteins found in the chloroplasts of higher plants (Leister, 2003) means that many proteins found in the chloroplast today have close orthologs in species from other evolutionary lineages. Not only were orthologs to chloroplast genes found, but they were enriched (P , 0.1; Fig. 4) for interactions with orthologs to genes that were also chloroplast localized in Arabidopsis. This suggests that interacting pathways have been moved to the chloroplast from other cellular compartments at some time postendosymbiosis. Closer inspection of this list of 55 chloroplast located interolog pairs reveals it includes components of a number of well characterized metabolic pathways that are known from the literature to have chloroplast located versions in plants, but to be mitochondrial or cytosolic in yeast and animals (Supplemental Table S1). This includes enzymes of biosynthetic pathways for purines, pyrimidines, heme, and riboflavin, but also enzymes in or associated with glycolysis and a range of proteins involved in posttranscriptional and translational machinery. These proteins have not simply been cannibalized to make new plant-specific pathways, but the original function of these pathways and complexes is likely to be preserved, and thus the predicted interactions of chloroplast protein orthologs are probably still functionally related. Figure 6. (Continued.) expression clustering results, indicating high degrees of coexpression, while the loops joining two AGI identifiers highlighted by the right arrow denote interolog pairs. The color of the loop indicates the interolog CV. The AGI identifiers are colored according to their biological functions: light green, transcription initiation; dark green, DNA mismatch repair; light blue, pyruvate dehydrogenase E1a and E1b subunits; dark blue, proteosomal complex components; magenta, spliceosomal components; orange, DNA replication; white, unknown. B, Clicking on the interolog loops in the above output will open an output window for an Arabidopsis Interaction Viewer query, providing more detailed information on the predicted and experimentally identified interactions present in the database.

Interolog Construction
Ortholog data for generating interologs were obtained from INPARANOID (http://inparanoid.cgb.ki.se/) and ENSEMBL (http://www.ensembl.org/ index.html) through BIOMART (http://www.biomart.org/), and loaded into a MySQL database. To obtain as many functional orthologs as possible in the dataset, especially for large gene families, we chose to include only individual ortholog pairs from each family. This reduced the size of the potential interactions, many of which are probably real, but many more, especially those involving divergent in paralogs, are likely false positives. More recent methods for predicting functional orthologs using evolutionary conservation of partners (Bandyopadhyay et al., 2006) or coevolution were not used due to the high computational demands in applying this method for entire genomes. Interactome databases were obtained from BIND (08-11-2005 release), MIPS (November 2005 release), BIOGRID (version 20), and DIP (November 2005 release). These interactome datasets can be found at http:// www.unleashedinformatics.com, http://mips.gsf.de; http://www.thebiogrid. org, and http://dip.doe-mbi.ucla.edu, respectively. Interactome and ortholog data included many different types of gene identifiers, so cross-identification tables were constructed from BIOMART, TAIR (www.arabidopsis.org), and from a kindly donation of data from Tanya Berardini, these tables are available upon request. Orthologs were mapped onto interactome data, and where both interacting proteins in a reference species had orthologs in Arabidopsis (Arabidopsis thaliana), an interolog prediction was recorded. The raw table of interologs is found in Supplemental Table S1 as an Excel spreadsheet. This includes the Arabidopsis proteins, the reference species, and the reference interactome for 37,235 predicted interactions, and has many duplicates where the same interaction was predicted from different species or interactome datasets. A separate sheet was generated containing 19,979 unique interacting protein pairs (entered in both orientations), along with the CV, the Pearson correlation coefficient of coexpression (PCC), and the predicted subcellular localization. A third sheet was added that includes the identification of each Arabidopsis protein in the dataset and the number of unique interacting partners that protein is predicted to have. The average number of interactions per interacting protein we predict by interologs in Arabidopsis is 10.9, while in the yeast (Saccharomyces cerevisiae) interactome the average number of interactions per interacting protein is 22.3, and in Drosophila 9.02. Due to the incomplete nature of interolog mapping, we expect some bias toward intermediate and large hub detection. For example, if we suppose interologs pick up 10% to 50% of all interactions among conserved proteins, a large hub in humans has 100 interactions and in Arabidopsis has 100 interactions, through mapping interologs we might detect 10 or 50 of those interactions and declare this Arabidopsis protein to be an intermediate or major hub. With the same detection rate, a human protein with just one interaction has a 10% to 50% chance of appearing in Arabidopsis and a 50% to 90% chance of not being included, thus proteins with only one or two interacting partners will likely be undercounted by the interolog method.

Calculation of the CV for Experimental Support
To estimate the strength of experimental support for each predicted interaction, we have calculated a CV. It is more convincing if different experimental methods predict the same interaction and the interaction is likely more conserved if it appears in multiple species. Our CV began with the total number of datasets the interactome appears in (N), and it was given a bonus multiplier if different experimental methods predicted the same interaction (E) and another multiplier if it was found in multiple eukaryote species (S). The formulation CV 5 N 3 E 3 S was thus our best attempt at determining the level of experimental support. The distribution of interactions by confidence was calculated for the CV and each component variable (presented in Supplemental Table S1; Fig. S4). Interactions can be thus divided low confidence (CV 5 1) that will contain some false positives, especially as some high-throughput techniques such as yeast two hybrid have likely generated many artifactual interactions especially in early yeast interactomes (Cornell et al., 2004). False-positive interacting pairs are unlikely to be repeated, discovered using different experimental techniques, or found in other species, and so are generally limited this low confidence dataset. Some experimental techniques such as phenotypic enhancement or suppression offer only indirect evidence for a physical interaction between proteins, and are as likely to represent genetic interactions. A total of 3,967 interactions are based on these indirect kinds of evidence, while 16,012 interactions have direct evidence for a physical interaction (e.g. affinity capture, yeast two hybrid, etc). A total of 1,268 interactions have both direct and indirect evidence. Experimental evidence type is included in Supplemental Table S1, on the sources of interactions worksheet, and each predicted interaction is flagged with direct/ indirect/both on the Arabidopsis interactome worksheet to allow researchers to exclude or include each subgroup.

Comparison of Interologs to Experimentally Derived Arabidopsis Interactions
A gold standard of experimentally generated interactions was established from 711 interactions of 356 Arabidopsis proteins mined from the literature by BIND . Of these proteins, 95 were found in the predicted interactome. Of the common set of 95 proteins, 85 interactions were identified experimentally (BIND) and 70 were predicted by our method. There were 30 interactions common to both sets, whereas 1.30 interactions expected to match by chance given the total possible number of 4,560 interactions between 95 proteins (595 3 94/2 for unique heterointeractions 1 95 self interactions) and extracting a random subset of 85 and 70 interactions (85/4,560 3 70 5 1.30).
Using a x 2 test with 3 degrees of freedom, we calculated the observed number of interactions (30 overlap, 55 BIND only, 40 our method only, and 4,435 noninteractors) and compared to the expected distributions (1.3 overlap, 83.7 BIND only, 68.7 our method only, 4,406.3 noninteractors). This gave a P value of 10 2142 , meaning that the observed distribution is very unlikely to occur by chance.

Subcellular Localization
Protein localization data was taken from SUBA (http://www.suba.bcs. uwa.edu.au; Heazlewood et al., 2005). SUBA contains direct or indirect experimental data on the localization of 6,743 Arabidopsis proteins from five sources of information: GFP fusion experiments, mass spectrometry studies, AmiGO annotation, Swiss-Prot annotation, and localization based on TAIR gene descriptions. From these data proteins are localized to the following 11 distinct cellular compartments: cell plate, chloroplast, cytoskeleton, cytosol, ER, extracellular space, Golgi, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole.
Occasionally, two or more of the information sources in SUBA annotate a protein as located in different compartments. In this study, a winner-takes-all approach was adopted in which a protein was designated in a compartment if a plurality of the five information sources above annotated it as belonging to that compartment. For example, if both the GFP and AmiGO fields in SUBA contain a localization of mitchondria but the Swiss-Prot field contains that of cytosol, the protein is designated as mitochondrial. Proteins for which no subcellular compartment was designated by a plurality of information sources were ignored. Note that for a large majority of proteins, SUBA contained data from only one source or that multiple sources of data agreed on localization. The winner-takes-all approach was used to resolve conflicting subcellular localizations for 358 proteins. The processing of the SUBA localization data assigned a single subcellular localization to 5,832 Arabidopsis proteins.

Enrichment Analysis
Statistical tests for the observed numbers of interologs with respect to the subcellular localization of interacting proteins were computed following the methods of Gandhi et al. (2006). The P value for the observed number of interologs n ab , where one protein is in subcellular location a and the other in b, is calculated using a Poisson distribution: Pðn ab Þ 5 + n ab j 5 0 n j ab expð2 n ab Þ=j!; n ab , n ab ðdepletionÞ + N j 5 n ab n j ab expð2 n ab Þ=j!; n ab $ n ab ðenrichmentÞ 8 > > > < > > > : Here n ab is the expected number of interologs with one protein in location a and the other in location b for the ensemble of random protein networks that maintain the following properties as the observed network: the annotation of proteins in their subcellular compartments, the degree (k) of each protein (the number of proteins that interact with it), and the total number of interacting pairs (E). n ab is given by where the indices i and j run over all interacting proteins and the indices c ia equal 1 or 0 if protein i is in compartment a or not, respectively. The term OR indicates that if both proteins are in the same compartment the term within the parentheses is 1. Self-interacting proteins in the Arabidopsis interactome were ignored in this analysis, to avoid spurious enrichment.
The P values are finally subject to a multiple-testing correction P (multi) 5 1 2 (1 2 P) m where for enrichment m equals the number of ab pairs with at least one observed interolog and for depletion m equals the number of ab pairs possible in the ensemble of random networks. The reader is referred to Supplemental Materials and Methods S1 or Gandhi et al. (2006) for more details on the analysis.

Coexpression Analysis
We examined Arabidopsis microarray data on the Affymetrix ATH1 chip from 945 AtGenExpress data sets (Schmid et al., 2005). Coexpression between pairs was determined using the Pearson correlation coefficient (r).
where N 5 number of expression samples, X 5 expression level for gene X in i th sample, and Y 5 expression level for gene Y in i th sample 21 # r # 1. High positive r values indicate a correlation of expression patterns, while low negative r values indicate anticorrelation. To generate an accurate representation of gene expression, we combined four AtGenExpress compendia (hormone, pathogen, stress, and tissue; see Supplemental Data for NASCArrays sample identifiers for the data sets in each compendium) into one large multi-data-set compendium containing 945 data sets in total. The Arabidopsis Genome Initiative (AGI) number to ATH1 probe set lookup was performed with a table from TAIR called affy25k_array_elements-2006-01-06.txt. r values were binned into 0.1 r unit bins to generate a distribution. To determine whether the interolog coexpression distribution was enriched in pairs that exhibited high correlation coefficients, we performed a two-sample Kolmogorov-Smirnov test on the interolog and random distributions. Random distributions were generated by randomly generating 20,000 protein pairs from Arabidopsis from within the collection of interacting proteins as predicted in this article, from any of the AGI IDs on the ATH1 GeneChip, or from any AGI ID on the ATH1 GeneChip such that the topology of this random set matched that of our predicted interactome in terms of hubbiness. For all of these random sets r values were computed and subsequently binned into 0.1 r unit bins. These statistics were generated using the R programming language Statistics package, which includes a built-in function ks.test (the R Project for Statistical Computing: The R Reference Index-Kolmogorov-Smirnov Tests; http://www.r-project.org/). This function computed a P value as illustrated in Marsaglia et al. (2003).
In attempt to find a Gold Standard to compare the interolog pairs' coexpression, we looked at the coexpression between pairs of confirmed Arabidopsis PPIs from the BIND database . However, these findings were not conclusive due to the smaller quantity of confirmed PPIs as compared to the interolog dataset. As well, it appears as though not all confirmed PPIs in Arabidopsis are strongly coexpressed, and, therefore, this comparison was omitted. Again, it is not an absolute requirement that interacting proteins exhibit coexpression, as one member may be constitutively expressed while another is induced under a specific condition. Indeed, we noticed that if we examined the r values for interolog pairs for the individual AtGenExpress compendia (e.g. the Schmid et al., 2005 developmental map compendium versus the Kilian et al., 2007 abiotic stress compendium) it is quite often the case that the pairs are much less correlated in their expression patterns in one compendium than in another.

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S1. RNA splicing network expanded by predicted interactions.
Supplemental Figure S2. RHO-RAB network expanded by predicted interactions.
Supplemental Figure S3. Homeobox network expanded by predicted interactions.
Supplemental Figure S4. Distribution and construction of the CV.
Supplemental Figure S5. Analysis of hub size.
Supplemental Table S2. Sources of microarray expression data.
Supplemental Table S3. Resolution of conflicting localizations in SUBA.
Supplemental Materials and Methods S1. Enrichment analysis.
Supplemental File S1. Interactome network file in Cytoscape format.
Supplemental File S2. Interactome network file in Osprey format.