Efficient exploration of pan-cancer networks by generalized covariance selection and interactive web content

Statistical network modeling techniques are increasingly important tools to analyze cancer genomics data. However, current tools and resources are not designed to work across multiple diagnoses and technical platforms, thus limiting their applicability to comprehensive pan-cancer datasets such as The Cancer Genome Atlas (TCGA). To address this, we describe a new data driven modeling method, based on generalized Sparse Inverse Covariance Selection (SICS). The method integrates genetic, epigenetic and transcriptional data from multiple cancers, to define links that are present in multiple cancers, a subset of cancers, or a single cancer. It is shown to be statistically robust and effective at detecting direct pathway links in data from TCGA. To facilitate interpretation of the results, we introduce a publicly accessible tool (cancerlandscapes.org), in which the derived networks are explored as interactive web content, linked to several pathway and pharmacological databases. To evaluate the performance of the method, we constructed a model for eight TCGA cancers, using data from 3900 patients. The model rediscovered known mechanisms and contained interesting predictions. Possible applications include prediction of regulatory relationships, comparison of network modules across multiple forms of cancer and identification of drug targets.

where θ c ij is the ij element of Θ c , and S c is the sample correlation matrix of cancer class c. The first penalty term controls the overall sparsity level of all precision matrices (1): the larger the tuning parameters λ c 1 are, the sparser the Θ c are. The elastic net parameter α improves the estimation stability in the presence of highly correlated variables (2). When α = 1 the penalty corresponds to a pure lasso penalty, and when α = 0 the penalty corresponds to ridge regression penalty. Common choices are 0.90 and 0.95 (3; 4) Here, α is set to a value of 0.95, which produces a sparse network. Sensitivity analysis (using λ 2 = 0 and λ 1 set to produce networks of 2800-3000 links) showed that changing α in the interval 0.9 to 1.0 changes network structure by less than 5 % ( Figure S1). The second penalty term is the so-called fused penalty (5; 6), where the parameter λ cc 2 controls the degree of differential connectivity, i.e. the tendency of links to be shared across cancers: the larger it is, the more the link value is constrained to a common value across the pair of cancer classes c and c . The sparsity constraints are further augmented by prior factors, ν ij , and adaptive modular factors, ω cc ij (see below).
In practice, integration of data types will include both continuous (e.g. mRNA) and binary (e.g. mutation) data. The partial correlations can be obtained from the precision matrix, for any distribution (7), thus construction of networks from partial correlations is well defined for all types of data, i.e. not only continuous Gaussian distributed measurements.
Banerjee et al (8) showed that sparse maximum likelihood estimation of partial correlations for binary random variables can be well approximated by the Gaussian model. Recent work has focused on semiparametric modeling via copula graphical models, e.g. (9) which extends graphical models to a mix of variable types. The idea behind copula graphical models is to estimate an optimal marginal transformation for each variable such that the transformed variables can be modeled with a multivariate Gaussian likelihood. In (9) the authors show that robust and efficient estimates of the partial correlation is possible without directly estimating the marginal transformations. Instead, a nonparametric estimate of variable-association, e.g. rank correlation, is used as input into the Gaussian graphical model. The strong results, both theoretical and simulation-based, of (9) motivate us to use the SICS framework to model TCGA data. For variables that exhibit non-Gaussian characteristics, rank-correlation thus serves as a proxy for marginal variable transformations. : Sensitivity analysis with respect to parameter α. We changed α in the interval 0.9 to 1.0. Compared to the default value of 0.95, the network structure changes by less than 5%, measured by 1-Jaccard index.

Sample size correction to reduce the impact of unbalanced data sets across different cancers
If identical penalty parameters, λ 1 and λ 2 , are used for all cancer classes, it is a simple exercise to show that the effective penalties are: (a) λ 1 /n c , for the sparsity constraint, where n c is the number of cases for each cancer, resulting in more sparse networks for cancer classes with smaller sample sizes; and (b) λ 2 nc+n c 2ncn c , for the differential connectivity constraint between cancer classes c and c , resulting in a more aggressive penalty on differential connectivity between small cancer classes. This dependence on sample size is undesired, since it is biologically reasonable to assume that different cancer classes have similar network sizes. Also, if data sets with extremely small sample sizes are included, their estimates would be close to empty and/or equal, rendering them uninformative. To control this behavior we define new effective To test our sample size correction proposal, we set up a simulation study as follows. We simulate a data set with C = 3 classes and p = 250 variables each. We subsequently construct networks for different values of δ and λ 1 ( Figure S2a), and compare the estimated networks to the true network and compute the true positive rate (TPR, proportion of true links found in the estimated network) constraining the false positive rate (FPR, proportion of false links found in the estimated network) to some small value. The maximum TPR is achieved for δ 0.4 across a wide range of FPR constraints ( Figure S2b). This illustrates that sample size correction is motivated since otherwise δ = 1 would have yielded a higher TPR. It also illustrates that no correction (corresponding to δ = 1), or naive sample size correction (6), (corresponding to δ = 0), is not optimal. For the TCGA data, we choose δ as the value that minimizes the maximum pairwise difference of network sizes for the different cancers. That is, we assume that an appropriate sample size correction should result in network sizes of similar size across all cancers. The maximum pairwise difference criterion is supported by results from the controlled simulation setting showing that the maximum pairwise difference measure is optimized near the optimum for TPR. By this criterion, for TCGA data δ should be in the range 0.05-0.1 ( Figure S2c), and for our analyses we chose to use δ = 0.08. Lower values of δ will make augmented SICS detect more links for cancers with a high number of patients, and higher values of δ will make the method detect more links for cancers with a low number of patients.  Figure S2: Choice of the sample size correction parameter δ. See text for details. (a) simulated data for three cancers: changing the optimization parameters δ and λ 1 find a point where the maximum pairwise network difference is small (δ ≈ 0.4). (b) selecting δ to produce a similar network size for the different cancers maximizes the true positive rate (TPR) of detected link. (c) curves used to select δ for the TCGA data. The global objective function includes a link specific prior, ν ij , which is designed to tune the sparsity penalty for forming an link between network nodes i and j. The sparsity penalty for link element (i, j) is defined as λ 1,ij = λ 1 * ν ij , where λ 1 is a common factor that controls the overall sparsity of the network, and ν ij take on three possible values: 1, u (< 1), or ∞. The motivation for this choice of prior is that it can serve to emphasize features of the model that are either more likely, based on prior information, or are of higher biological interest to the end user. In such cases ν ij is set to the value u < 1. This reduced penalty is applied in the following situations: • between miRNAs with their predicted mRNA targets, as defined by miRanda (10) prediction (Micro-Cosm Targets Version 5 (11), http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/). This choice is motivated by the belief that such links are more likely to be real than links that do not involve defined miRNA-mRNA target relationships.
• between cis localized methylations probes with their corresponding mRNA, as defined by associations between genes and methylation probes provided in the level 3 data provided by TCGA. This choice is motivated by the belief that such cis-localized probes are likely to be involved in transcriptional suppression. Many of the detected links between promoter methylations and mRNAs do indeed have a negative sign, consistent with this expectation (Figure 2).
• between all interactions involving a point mutation. This choice is motivated by the belief that point mutations are key determinants of the molecular phenotype.
In addition, the prior is used to model the assumption that the effect of CNAs on transcription is only via cis-effects, i.e. mRNAs can only be linked to CNAs at their coding locus, which is done by setting ν ij = ∞ for all trans-interactions that involved CNA and an mRNA, and ν ij = 1 for all cis interactions. We chose the value u = 0.75 in our analyses, and found upon inspection that this prior helps give a balanced model, with involvement of the different data types.
With the proposed prior model, we primarily seek to obtain a well balanced model with connection between the data types. Using no prior at all produced results with an extensive number of links produced between CNAs in close genetic proximity and methylation probes in close genetic proximity, which we regard as a less informative network. We therefore also set ν ij = ∞ for such connections. We performed a simulation study to investigate the impact of this last restriction and found that while the network weights for other network connections were altered, the network structure itself was not much affected. Also, we investigated the enrichment of variable pairs included in the prior among strong correlations compared to weak or opposite sign correlations (Supplementary Table 1). This shows a significant enrichment for target predicted miRNA-mRNA associations for correlations < −0.2 compared to correlations ≥ −0.2. It also displays significant enrichment for cis located methylation probes and mRNA among strong negative correlations, and for cis interactions between CNA and mRNA among strong positive correlations.
To quantify the overall impact that the prior has on the network we performed a sensitivity analysis, in which we explore the effect of changing the prior strength u (defined above) from its default value of 0.75 to values in the range 0.50 (stronger prior) to 1.00 (a flat prior). When increasing u to 1.0, less than 4% of links changed for the whole network, and up to 15% of links changed for miRNA-mRNA links ( Figure S3).
When instead reducing the prior, a similar effect is seen: fewer than 10% of links changed for the whole network, and up to 35% of links changed for miRNA-mRNA links. For the case of a flat prior (no link specific information provided), we further explored if a network built using a flat prior would enrich for the link-specific prior. Such a simulation showed up to 100-fold enrichment of our prior links. The enrichment depended on the data type: for CNA-mRNA and DNA methylation-mRNA links the enrichment was up to 100-fold, and for miRNA-mRNA links, the enrichment was up to 5-fold ( Figure S4). The simulations in this test were done using λ 2 = 0 and λ 1 set to produce networks of 2800-3000 links.
In summary, while a prior formally does not require validation (because it reflects a belief), our assessment of the prior shows that it is informative and has a moderate and tunable effect on the network solution.
Changing the prior structure will likely be useful to bring forward different aspects of the data, reserved for future work. were compared to networks constructed for a range of u values between 0.50 (a stronger prior) and 1.00 (a flat prior). The average network difference (1-Jaccard index, averaged across the 8 cancers) was used to quantify the deviation from the u=0.75 case. The analysis showed that the effect of the prior is moderate, changing up to 10% of the entire network structure. miRNA-mRNA links changed less than 20% when comparing our default value to a flat prior, indicating that such links are to a high degree determined by the data rather than the prior.

Improved stability and interpretation of models via modular constraints.
The adaptive factor ω cc ij is designed to improve the stability of the network estimates and generate interpretable networks. This is done by a two-step adaptive lasso (12) method, in which preliminary network estimates (obtained using ω cc ij = 1) are used to update ω cc ij to a new value obtained from the initial network estimateΘ. The purpose of the update is to encourage all links within a module, or local sub-network, to exhibit the same link commonality or link differential connectivity properties across cancers. This is achieved by the following update: N ij denotes the set of neighbors of link (i, j), i.e. the set of links connected to nodes i and/or j.  Table S1: Enrichment of variable pairs included in prior among strong correlations. mRNA + miRNA: Enrichment of miRNA targets among correlations < t compared to correlations ≥ t. mRNA + methylation: Enrichment of interactions between cis located methylation probes and mRNA among correlations < t compared to correlations ≥ t, all methylation-mRNA associations included. mRNA + CNA: Enrichment of cis interactions between CNA and mRNA among correlations > t compared to correlations ≤ t, all CNA-mRNA associations included. Fisher's exact test was used for all enrichment calculations, i.d.= insufficient data (no observations above/below threshold).  Figure S4: Constructing networks using a flat (non-informative) prior enriches for prior links. The presented method uses link-specific priors in three different contexts: CNA-mRNA links (penalty = inf for different locus), methylation-mRNA (reduced penalty for methylation in the same (cis) promoter) and miRNA-mRNA (reduced penalty for miRNA-mRNA pairs for which there is a miRanda target prediction. We used the same fold enrichment statistic as in the main manuscript (c.f. Figure 2b). Curves are the average for 8 TCGA cancers. Similar to our analysis of PathwayCommons links, our method enriches also for priors by a factor ranging from 5 (miRNA links) to >100 (CNA-mRNA links).
adaptivity factor ω cc ij encourages fusing of link (i, j) for cancer classes c and c when (a) their link values, θ c ij andθ c ij , are close in the initial network estimate and/or (b) when this is true for neighboring links.
Applying this update to the TCGA data, we note improvement in network estimation stability, measured by Kendall's W ( Figure S5a) and enrichment of links from PathwayCommons ( Figure S5b). These metrics, previously used in (13) are described below.

Robust network construction using bootstrap.
Network estimation is a difficult problem and the results are often unstable. To produce robust network estimates we propose to use the bootstrap and compute aggregated results. Specifically, we repeat the estimation on B = 500 ( Figure S6) bootstrap data sets which consist of 90% randomly chosen tumors from each cancer class. Pseudo-code for the construction of a correlation matrix based on bootstrapped data is  Figure S5: modular constraints on network structure improves stability and functional enrichment. We derived networks from TCGA data with (curves) and without (baseline, y=0) modular constraints. The y axis represents the improvement of network robustness, measured as 1-Kendall's W (a) and pathway enrichment, defined below (b). The x axis represents the logarithm base 10 of network size.  (2), an estimate of the probability that link (i, j) differs in value between cancer classes c and c given the link is present in Links whose frequency statistics n c ij and n cc ij exceed a threshold T p and T f , respectively, constitute our (unsigned) network estimate. Specifically, the threshold T p thus controls the stability of the sparsity structure of the estimates: its value corresponds to the minimum proportion of bootstrap estimates in which an link is present. The final differential connectivity pattern is obtained using n cc ij as input into hierarchical clustering, then cutting the resulting cluster tree at height T f . The threshold T f controls the robustness of the differential connectivity pattern since its value corresponds to the maximum proportion of bootstrap estimates below which a link is fused.
Finally, we estimate the link signs. For differential links, let N + and N − be the number of bootstrap estimates in which the links were positive and negative, respectively. If N − < N + , the final estimate for the link is negative, otherwise is positive. For links whose values coincide across cancers, we similarly, define where the sum runs for the cancers in which the links are fused. If N − < N + , let the final estimate for the link be negative, otherwise positive.
The bootstrap link frequency histograms are highly informative. Reasonable values for the penalty parameters, λ 1 and λ 2 , result in bootstrap frequency histograms with a distinct U-shape (see Figure 2a in main manuscript and Figure S7a), indicating that the links comprise two populations: those frequently present (or differential) and therefore likely to correspond to "true" findings, and those frequently absent (common) in the network estimate. If λ 1 and/or λ 2 are too small or too large, the histograms shift toward the right or left, suggesting an overfit or underfit of the network models. Given a U-shaped histogram, the thresholds T p and T f should be set at values that separate the two link populations seen in the histograms. In Cancer Landscapes we allow for exploration of networks of different sizes (controlled by the penalty parameter λ 1 ).
The final networks presented result from a threshold value of T p = 80%. Similarly, the presented cancer specific differential connectivity patterns result from a threshold value T f = 60%. The results are quite robust with respect to the threshold value used (see below). These particular values were chosen for the presented networks from a simulation study, based on preliminary estimated networks, where the FDR with respect to the known simulation-model was computed.

Estimation of FDR from bootstrap data.
The FDR with respect to T p and λ 1 is defined as the proportion of false links present in the estimated network. The FDR with respect to T f and λ 2 is defined as the proportion of false differential links among links estimated to be differential. In the simulation study that was the basis for selecting a bootstrap threshold, the true and false links (in terms of presence and/or differential presence between cancers) were known and the FDR could be directly computed. The bootstrap data can also be used to estimate the FDR when the true network model is not known.
Building on the BINCO procedure (14), we estimate FDR as follows. For a bootstrap frequency histogram, we model the two populations of links using a mixture of beta-binomials (see Figure 2a, Figure S7). This allows for an link-specific inclusion probability in the network estimate. We extend the BINCO-procedure for the false links (that should not be present in the models) using a zero-inflated beta-binomial to take the high-dimensionality and extreme sparseness of TCGA large-scale network models into account (15). Once the mixture model is fit to the bootstrap link frequency histograms, we obtain a model-based estimate of the FDR as a function of the bootstrap threshold: where the numerator is the area under the estimated density above threshold T p for the false link population and the denominator is the sum of this and the area under the estimated density above threshold T p for the true link population.
Recognizing that a similar FDR estimation procedure can be applied to the differential connectivity problem, we fit a beta-binomial mixture to each of the pairwise cancer comparison frequency histograms ( Figure S7).
We can thus estimate the FDR for detecting differential connectivity for each cancer comparison as Figure S7a illustrates the procedure. The red line is the false differential link density and the blue the true differential link density. FDR for a vertical cut, T , along the x-axis correspond to the area under the red curve to the right of T divided by the sum of this and the area under the blue curve to the right of T (14).
In Figure S7b we summarize our findings. The estimated FDR is clearly quite robust and not much affected by the particular threshold used. This is due to the stability of the estimates across bootstraps. The histograms indicate that most links are persistently differential or common (extreme U-shape of the histograms). Moreover, the FDR for differential connectivity is controlled below 10% for reasonable values of the fuse penalty, λ 2 . For values of λ 2 exceeding 0.0075 FDR increases and cannot be controlled below 10%, suggesting an over-penalized model. We can clearly see that the FDR is minimized for λ 2 between 0.0025 and 0.005, indicating that these values for the tuning parameter produce the most stable differential connectivity estimates across bootstraps. These are the penalty levels used to produce the networks analyzed in the main manuscript.
Subsampling and cluster computing.
Estimation of a network on all variables (more than 1 billion pairs of variables) was practically unfeasible for further analysis. We therefore first filter the variables as follows. Only pairs of variables with correlation above the threshold 0.7, in any cancer, are considered. This screening method is valid for values of the sparsity penalty λ 1 ≥ 0.7 since variables with correlations < 0.7 will by construction not be part of the estimated networks (6). The screening thus accelerates computation time by dropping irrelevant variables without affecting network quality. To account for the prior (Section 4 above), CNA or methylation variables correlating over the threshold but only with other CNA or methylation variables, respectively, were not included. Finally, miRNA and mutation variables were included if they correlated more than a lower threshold of 0.7 · 0.75 with any other variable. After filtering, the number of variables in our network model has been reduced to p =22,447. Pseudocode describing the filter procedure is available below in section 1.3.3.
To further reduce the execution time, we adopt the blocking method of the correlation matrices by (1; 6) and implement a subsampling method for each bootstrap as follows. Blocking of the correlation matrices into subproblems was done for each cancer separately, for λ 2 = 0 and the lowest chosen setting of λ 1 . Each of the subproblems with number of variables ≥ 1000 was subject to a subsampling procedure: All variables included in the union of all V r , or present in a subproblem with number of variables less than 1000, were included in the next step. The union of selected variables over cancers was used for further analysis. We performed several simulation studies to decide on the number of subsampling steps and the   Figure S7: a: Beta binomial mixture distribution to estimate FDR of differential network connectivity.
Bars represent the fraction of bootstrap simulation in which two network links are differential between a pair of cancers. Red and blue curves: estimated probability density function (mixture beta-binomial) of false and true differential links, respectively. Bottom panel is a detailed view of the histogram and model fit, excluding the dominant 0 and 100% counts. This detailed view is provided for illustrative purposes since it clearly shows that there are differential frequency counts in the entire range of 0 to 100%. We also see that the false differential link distribution (red line) is flat toward the right side of the histogram, which explains the stability of the FDR estimate as a function of bootstrap threshold (panel b). b: FDR estimates of differential connectivity computed at two different cut-offs (50% of bootstrap runs, and 100%, respectively). Bars represent FDR values obtained for different fuse penalties (λ 2 value, groups along X axis) and different sparsity penalties (λ 1 value, blue shades). Numbers over bars indicate network size. Note that FDR is well below 5% for fuse penalties less than 0.0075.
size of the subsampling problem. With the above setting we are ensured that all variables get a chance to "compete" to be part of the final network model.
We optimize the penalized likelihood using the alternating directions method of multipliers, ADMM. For a complete description of the method see (16) and (17). To speed up the estimation, link-weight updates in the for r = 1 → 100 do S r ← correlation matrix for 100 randomly picked variableŝ Θ r ← estimated network from S r V r ← variables present inΘ r end for iterative algorithm are vectorized. In order to enable simulations for multiple penalty parameter settings, we set up a computational framework to interact with a 268 node computer cluster (C3SE) localized at Chalmers University of Technology in Gothenburg, Sweden. The implementation of our method and loading of the data from a mySQL database was done in Matlab.
Source code in Matlab is available as Supplementary files. The main script for the ADMM solver is named codeMatlab/ADMMk/ADMMk.m. Scripts, and data matrices (codeMatlab/cancerdata.tar.gz) are also available for calculation of the correlation matrix S and the prior matrix L, main scripts codeMatlab/prepare_S_Lp/computeSmatrix.m and codeMatlab/prepare_S_Lp/constructLprior.m. Additionally, the Matlab scripts used to calculate all bootstrapped networks and summarize them are given, main script CodeMatlab/cluster/Main.m. These scripts, however, are specifically designed for the C3SE cluster and requires adaption to a user specific environment.

Performance metrics.
We used two different metrics to characterize performance, previously described in (13). Figure 2b and Figure S5b, estimated networks were compared against pathway databases HPRD, NCI-NATURE, REACTOME and IntAct downloaded from Pathwaycommons.org. We map identifiers in the databases to our set of variables. We then compute the shortest path, P ij from gene pairs (i, j) in the database using Johnson's algorithm. We define the pathway enrichment of a network Θ as follows:

For analyses in
The graph Θ permuted is obtained by randomly permuting the rows in Θ (and the columns the same way), equivalent to randomly re-assigning gene names. We used a path-length k = 2 in our calculations. The numerator and denominator were estimated as follows: where N = p * (p − 1)/2 is the total number of possible links in the network. Similarly, where R = 100 is the number of random permutation graphs created.
For the analysis of estimation stability (robustness) of networks, we used 1-W, where W=Kendall's correlation coefficient, plotted as a function of network size (c.f. (13)). A low value of 1-W indicates that the method gives comparably stable network structures.

Methods analyzed
Using the two above metrics (enrichment and stability) we compared the proposed augmented SICS to the following methods: • glasso (1) -a partial correlation based method. It estimates the precision matrix (inverse correlation matrix) by optimizing its lasso penalized likelihood function. The lasso penalty introduces sparsity in the precision matrix by penalizing the absolute value of its elements. A penalty parameter λ controls the sparsity degree. • ARACNE (19) -a method that computes pairwise connectivity strengths between variables using mutual information. To introduce sparsity, the number of false positive links in this graph is reduced by considering triplets of variables for which all three pairwise strengths exceed a significance threshold (i.e. were part of the sparse graph). Using the data processing inequality (DPI), the smallest link among the triplets is removed.
• MINE (20) establishes the dependence strength of a pair of variables through the maximal information coefficient (MIC). The MIC is built on the fact that a data scatterplot can be partitioned by a grid that best encloses their relationship, if existent. It is computed as the maximum mutual information that the data can achieve given a certain grid. Sparsity is introduced by multiple testing on the estimated MIC scores.
In a comparison that focused on all mRNA data (9104 mRNA variables that passed the filter described above), we saw strong performance of the proposed SICS method in terms of pathway enrichment scores ( Figure 2, main paper). This comparison could not include MINE because of prohibitively long run times.
To include MINE, we conducted a more limited analysis as follows. On the three cancers with the largest number of tumors in the TCGA database (glioblastoma, breast cancer and ovarian cancer), we selected a random subset of 500 mRNA transcripts from the 1103 transcripts that passed the variable filtering step described above. We then randomly split the tumors into two datasets per cancer. For each dataset we estimate networks using all methods, and sweep over tuning parameter values that control the sparsity of the network. For SICS this corresponds to varying λ, for ARACNE DPI, for WGCNA β, for MINE the MIC scores, and for augmented SICS λ 1 and λ 2 . We consider each λ 2 setting as a separate method, with different levels of stringency applied to differential link connectivity. All the methods thus generate a series of networks with different sparsity levels. The entire procedure is repeated B = 10 times. For fairness of comparison, we did not use the bootstrap robust estimation for our method here.
We depict 1 − W as a function of network size in Figure S8. Thus, the network sizes that are deemed stable correspond to minimum values in the figures. Results show that augmented SICS with higher λ 2 , as well as WGCNA, are both stable methods. However, augmented SICS networks are most stable for network with size around 1000 links whereas WGCNA favor networks with 3000 links.. This result is consistent, since the correlation based network corresponding to a sparse partial correlation one is expected to be denser. ARACNE performs poorly. Augmented SICS networks with moderate λ 2 outperforms standard SICS. MINE performs similarly to WGCNA, but is computationally more heavy to apply and is not suitable for application on the full dataset.

Data preparation and processing
The proposed method can be applied to, but is not restricted to, data sets from the TCGA (the Cancer Genome Atlas, http://cancergenome.nih.gov) database. TCGA data are organized into technical platforms, and we chose the platform for each data type and cancer that maximized the number of patients in that dataset. The number of patients available for each combination of data types is displayed in  Figure S8: Stability of augmented SICS (aSICS) and a set of reference methods, measured by 1-W, where W=Kendall's correlation coefficient, plotted as a function of network size (c.f. (13)). A low value of 1-W indicates that the method gives comparably stable network structures.
were downloaded as TCGA level 3 data, except for point mutations that were downloaded as level 2 data, and were post-processed as described below. The data was assembled in a mySQL database to enable fast creation of data matrices during simulations. The following subsections contains pseudocode describing the assembling of data, construction of the bootstrapped correlation matrices, and filtering of variables.
• mRNA and miRNA expression data. The level 3 mRNA data provided by TCGA is a list of known protein-coding genes with their corresponding measured mRNA expression values, for each patient.
Data from the Illumina RNA Sequencing platforms were log transformed. Further, all data were quantile normalized within each cancer and platform. MicroCosm Targets Version 5 (11), a summary database of published predicted miRNA targets, was used to map the miRNAs to their target genes.
• DNA sequencing data (non synonymous somatic variants and indels) Each gene was flagged as mutated (1) if it had at least one non-synonymous mutation called by TCGA (level 2 data). Synonymous mutations were not considered. To avoid spurious high correlations between genes with few calls, genes with fewer than 5 patients affected were not included in the analysis.
• DNA copy number aberration data. The level 3 CNA (genetic copy number aberration) information provided by TCGA is represented, for each patient, by the amplitude and genetic positions of beginning and end of DNA segments that have gained or lost copies. Each gene available in NCBI human Build 36.1 was mapped to the segments and assigned the amplitude of the corresponding segment. In the case of multiple segments covering the gene, the average amplitude was used, weighted proportionally to the length of the parts of the segments covering the gene. Genes with a CNA value, but lacking a mRNA measurement were discarded from the analysis.
• DNA Methylation data. Methylation data in TCGA is supplied as a methylation probe with genetic position and a beta value (the ratio of the methylated probe intensity and the sum of methylated and unmethylated probe intensities). Methylation probes with a standard deviation across the patients > 0.05 were kept for further analysis. The predictions of methylation site gene targets provided by TCGA were used. The methylation probe values were replaced by their rank values in the correlation calculations, to overcome the issue of the values not following a normal distribution.

Online application
The Cancer Landscapes (CL) browser has been implemented as a web resource for several reasons, the main one being accessibility. To facilitate the exploration of the models, we make the models and analysis results available through any modern web browser (Chrome, Firefox, Safari, Opera, IE 9+) and an internet connection. Although the system works well in most browsers on a reasonably modern computer, we recommend the use of Google Chrome, because it shows better performance across core technologies.
The Cancer Landscapes browser provides a large number of analysis tools that can be applied to any of the models available in the CL library. These tools can be divided into three categories; basic network exploration, integration of biological data and mathematical methods. We also allow the user to download results of their analysis, or the entire model, for further exploration in other software, such as Cytoscape.

Basic network exploration
A network model is drawn as a number of nodes representing variables and links (edges) between them representing associations. Each node has a specific data type (e.g. gene expression or methylation) and each link can be present in different data classes included in the model, e.g. different types of cancer. The user can choose which data types, which data classes and any combination of these to view. The user can also choose to view links that are present in all of the classes in the model, links unique to each cancer and any combination thereof. There are also multiple options for adjusting the appearance of the network, e.g. colors and link styles. The user can zoom (scroll) and pan (click and drag) around the network to closer inspect regions of interest. By clicking on a node, a window with further information about that node appears. Here the user can view the local properties of that node as well as access further information such as the entry for a gene in the NCBI or OMIM databases.

Links to third-party databases and underlying molecular data
Information about genes in the models, such as clinical results and chromosomal positions, is sourced from the NCBI, OMIM, PubMed, GOslim and PathwayCommons databases. This data is retrieved on-the-fly from each of these sources once the user requests information about a particular gene (by clicking the associated node in the network). This reduces the amount of data transferred and also guarantees that up-todate data is always presented to the user.
The user can highlight genes from any pathway by searching through lists of the available pathways. It is also possible to manually select a region of the network and see which pathways that most overlap with nodes in that region. Such a region may be defined as an area within some distance of a certain point, or as a community within a certain number of steps in the network from a particular node. This allows the user to easily get a feel for how different biological functional categories are spatially distributed, as well as how they relate to other properties of the network structure, such as differences between classes. After performing a clustering of the network, as discussed below, each gene group is associated with the pathway with which it most overlaps. This gives an overview of which functional categories are present as contiguous network structures in different classes of the model. The network models presented in the system builds on a large amount of data from a wide range of diagnoses and data platforms. To fully appreciate the network models it is useful to have access to the underlying data.
In the Cancer Landscapes web browser the user can view measured data for all nodes as well as survival data for patients included in any particular model. This information is presented as scatter plots between the underlying data corresponding to nodes in the network, i.e. the data from which the links are inferred.
These plots illustrate the differential behavior between classes of data and allow the user to get the unfiltered view, not dependent on any network inference.
Kaplan-Meier survival curves with censoring can be viewed for groups of patients corresponding to the classes of the model. Patients can also be stratified based on the upper and lower quartile of data values corresponding to a node (e.g. mRNA expression). The corresponding Kaplan-Meier curves thus capture the difference in survival between groups of patients with differential node data measurements. For nodes corresponding to discrete data, a grouped boxplot is shown to illustrate the difference in values of one node when stratifying based on the discrete values of another node.

Scoring individual nodes by centrality and survival association
We provide the user with multiple measures of node importance to aid in the search for gene targets. Currently, there are three available measures of node importance or node centrality.
• Survival Association. The survival differences between patients stratified by the upper and lower quartile of data for each node (e.g. mRNA expression) is tested via the log-rank test and the corresponding p-value computed. The node-size is then presented as proportional to the −log10(p-value) to visually summarize the network structure in terms of survival association.
• Node Degree. The degree centrality of a node is simply the number of outgoing and incoming edges (links) to that node.
• Betweenness centrality. Betweenness centrality (30) corresponds to how often a node is found on the shortest path between two other nodes. To compute this, we solve a shortest path problem for all pairs of nodes and keep track of how often each node is found on any of the shortest paths. The resulting values are then scaled so that all node centrality values lie between zero and one. The all-pairs shortest path problem is efficiently solved using the algorithm proposed by Brandes (31).
• PageRank. The PageRank measure of node importance is one of the central parts of the Google system for search results ranking (32). This centrality measure corresponds to the average time spent at a certain node when randomly browsing the web where nodes are webpages and links are hyperlinks between pages. The PageRank problem is an eigenvalue problem, which we solve by power iteration. This is done after decomposing the transition probability matrix into a sparse matrix and dense vectors.
This allows the matrix multiplication in the power iteration step to be efficiently performed for sparse networks.

Network layout
The graphical display of the network representing a model is integral to the interpretation of that model.
Network layout procedures come in many flavors and each has their benefits and drawbacks. To make the layout of networks flexible the system provides an initial layout that the user can modify.
The layout algorithm chosen is comprised of three steps. First, the network is split into connected compo- The ForceAtlas2 algorithm is similar to the commonly used force directed algorithm layout proposed by Fruchterman and Reingold (34) in which nodes attract each other with a linear force and repels each other with a force proportional to the square of the distance between them. ForceAtlas2 uses a linear attraction force but a repulsion that depends on the number of links adjacent to a node. The goal with this adjustment is to let poorly connected nodes be placed closer to highly connected ones. ForceAtlas2 also uses an adaptive scheme to adjust the distance a node is allowed to move in each iteration. This has the same effect as the cooling temperature in the force directed algorithm. These parameters can be chosen individually for each component in the system, which provides a great deal of adaptability to the algorithm.

Identification and enrichment analysis of network modules.
Clustering was performed using hierarchical clustering with average linkage. The distance matrix used was

Characterization of network modules.
Enrichment of a certain feature across clusters was calculated for a number of features (e.g. survival associated nodes for each cancer, pathways), by a hypergeometric test (Fisher's exact test, calculated -using MATLAB notation -as 1-hygecdf(x-1,M,K,N), where M is the number of network nodes, K the number of labeled nodes, K the size of the cluster and x is the number of labeled nodes in the cluster). A node was defined as labeled when the corresponding gene is annotated with the feature of interest, e.g. membership in a certain pathway.
The p-values were adjusted using Benjamini-Hochberg correction, for each feature separately and considered significant if they were less than 0.05 after this. To define survival associated nodes for this analysis, we used a Kaplan-Meier log-rank p-value cutoff of 0.05 (this is a deliberately inclusive threshold to avoid very low counts in enrichment testing).
For completeness, a full list of network modules analyzed in the paper is found in a supplementary data file.
In general, however, we recommend using the Cancer Landscapes system for exploring modules.
1.5 Analysis of co-occurrence between IDH1 mutation and chromosome 11p Figure S9: 11p15 deletion co-occurs with IDH1 mutation (a) Oncoprint plot (22) illustrating cooccurrence of IDH1 mutations and homozygous deletions in 14 genes located in the 11p15 region found to be correlated by our model in the eight cancers analyzed. All high-grade glioma patients carrying homozygous deletions in these genes have IDH1 mutations (p<0.00001, see main paper). While the co-occurrence is present in the majority of low-grade glioma patients with both types of mutations (elevated odds ratio, but not significant), it is not observed in the other types of cancers.