Abstract

Recently, extracting inherent biological system information (e.g. cellular networks) from genome-wide expression profiles for developing personalized diagnostic and therapeutic strategies has become increasingly important. However, accurately constructing single-sample networks (SINs) to capture individual characteristics and heterogeneity in disease remains challenging. Here, we propose a sample-specific-weighted correlation network (SWEET) method to model SINs by integrating the genome-wide sample-to-sample correlation (i.e. sample weights) with the differential network between perturbed and aggregate networks. For a group of samples, the genome-wide sample weights can be assessed without prior knowledge of intrinsic subpopulations to address the network edge number bias caused by sample size differences. Compared with the state-of-the-art SIN inference methods, the SWEET SINs in 16 cancers more likely fit the scale-free property, display higher overlap with the human interactomes and perform better in identifying three types of cancer-related genes. Moreover, integrating SWEET SINs with a network proximity measure facilitates characterizing individual features and therapy in diseases, such as somatic mutation, mut-driver and essential genes. Biological experiments further validated two candidate repurposable drugs, albendazole for head and neck squamous cell carcinoma (HNSCC) and lung adenocarcinoma (LUAD) and encorafenib for HNSCC. By applying SWEET, we also identified two possible LUAD subtypes that exhibit distinct clinical features and molecular mechanisms. Overall, the SWEET method complements current SIN inference and analysis methods and presents a view of biological systems at the network level to offer numerous clues for further investigation and clinical translation in network medicine and precision medicine.

Introduction

In most instances, especially for complex diseases, phenotypic changes have been noted as challenging to entirely elucidate by a single molecule (e.g. gene and protein) or pathway [1]. Instead, complex networks are frequently used to characterize perturbations in biological processes since network structures are altered as phenotypic changes occur [2, 3]. However, current biological networks as the background are still incomplete and belong to the aggregated ones derived from experiments on various cell types and states [4, 5], implying that it is difficult to directly use these networks to investigate how structural changes reflect the phenotypic alterations for a specific disease (or cell type/state). To estimate a biological network for a particular phenotype, previous works have proposed that coexpression or gene regulatory networks can be evaluated by correlation coefficients between molecule pairs from sequencing data (e.g. gene expression) across multiple samples [6–8]. Even though the data represent a spectrum of phenotypes, the approaches for inferring these kinds of networks require the consideration of multiple samples, resulting in being regarded as an aggregate network shared across a population of samples rather than single-sample networks (SINs) [6, 9]. One way to achieve personalized medicine is to unravel the molecular mechanisms of diseases for each individual by investigating the patterns of interactions between cellular components and how changes in these network structures contribute to reflecting the homo/heterogeneity in distinct cells, across multiple tissues and between different phenotypes [2, 3]. Nevertheless, accurately constructing SINs to capture both the individual characteristics and complexities of disease is still a critical challenge.

In support of this pursuit, several methods have been developed for exploring sample-level characteristics based on the inference of SINs, such as linear interpolation to obtain network estimates for single samples (LIONESS) [3], sample-specific networks (SSNs) [10, 11] and cell-specific networks (CSNs) [12]. The SSN method use statistical perturbation analysis of one case sample against a group of control samples to construct the SIN for each case sample. Although the control samples could be utilized to filter the noise in estimated coexpression networks, this requirement also restricts their applications, such as cell line-based and single-cell-based analyses. To compensate for this limitation, the CSN method was designed based on a new theoretical model derived from statistical dependency to transform the data from ‘unstable’ gene expression to ‘stable’ gene association. Yet, these methods depend on a differential analysis of the underlying expression data and hence may miss any information shared across the population [3]. To avoid relying on differential analysis, Kuijjer et al. proposed the LIONESS method to reconstruct the aggregate network in a population of case samples and further estimate SINs for each case sample by applying linear interpolation; in other words, it is an approach to reverse engineer SINs from aggregate networks. It has been known for a long time that common diseases are actually comprised of many subgroups or endotypes (i.e. intrinsic subpopulations), characterized by dissimilarities in terms of the manifested phenotypes and often different underlying mechanisms [13]; for example, distinct cancer subtypes in a cohort of patients [14]. However, the sample size differences between intrinsic subpopulations may cause a network size bias in the statistical perturbation model for the SSN method, in the statistical dependency model for the CSN method and in the model of removing a single sample from an aggregate network for the LIONESS method.

To address these issues, we propose the sample-specific-weighted correlation network (SWEET) method that can estimate SINs by considering the (i) intrinsic homo/heterogeneity between subpopulations without prior knowledge (i.e. genome-wide sample weights) and the (ii) statistical perturbation analysis of the perturbed network for each case sample against the aggregate network for a group of case samples. Using the RNA sequencing (RNA-seq) data of 16 cancer types collected from The Cancer Genome Atlas (TCGA) [15], the simulation analysis illustrates that the genome-wide sample weight can effectively reduce the network edge number bias (i.e. network variation) resulting from an imbalance of intrinsic subpopulation sizes (Supplementary Figure S1). In comparison to the other SIN inference methods, the SWEET SINs are more likely to fit the scale-free characteristics, display a higher overlap with the human protein–protein interaction (PPI) networks and have better performance in identifying the cancer driver genes, cancer genes and prognostic genes in 16 cancers. Analyses of the RNA-seq data of 7177 patient samples from TCGA and of 1101 cell lines from the Cancer Cell Line Encyclopedia (CCLE) project [16] further validated the effectiveness of the SWEET method in characterizing individual features, such as somatic mutation genes (SMGs), mut-driver genes (MDGs) and essential genes. The use of the SWEET method in conjunction with network proximity facilitates the identification of repurposable drugs. For example, a half-maximal inhibitory concentration (IC50) value of ≤0.72 μM for the SCC-25 cells treated with encorafenib and albendazole. Moreover, by applying the network degree matrix of SINs, the SWEET method can not only effectively discriminate different cancer types (or cell line types) but also identify two potential subtypes of lung adenocarcinoma (LUAD) with distinguishing clinical features. In summary, the SWEET method offers an approach for building SINs for individual samples from RNA-seq data, addresses several limitations of current SIN inference methods and presents a view of biological network systems to compensate for the current analyses of omics data.

Materials and methods

Gene expression datasets

To construct the SINs for primary tumor samples across 16 cancer types, we first downloaded the RNA-seq profiling data of 6480 tumor samples and 697 corresponding normal samples from TCGA [15] by using the ProcessRNASeqData function of TCGA assembler [17] (Supplementary Table S1). For each gene, the log2 fold change (log2 FC) between the expression value in a specific tumor sample and the average expression value across all normal samples was used to evaluate the differential expression level. To establish the SINs for individual cell lines, we collected the RNA-seq data of 1019 cancer cell lines from the CCLE project (using the CCLE_RNAseq_rsem_genes_tpm_20180929.txt file) [16] via the Broad DepMap Portal. These cell lines can be further classified into 26 types (Supplementary Table S2). The transcripts per million values were log2 transformed before being used for further analysis.

Cancer driver gene, cancer gene and prognostic gene sets

To evaluate the associations between different kinds of cancer-related genes and hub genes in SINs, we collected and curated three types of gene sets, including cancer driver genes, cancer genes and prognostic genes (Supplementary Note 1). The cancer driver gene set collected from the Cancer Genes Census (CGC; downloaded on Jan 2021) [18] and Network of Cancer Genes (NCGs; version 6.0) [19] databases contains 725 pan-cancer driver genes (Supplementary Data 1). To collect the cancer gene set for each cancer, the disease IDs provided by the medical subject headings (MeSH) [20] thesaurus were used to represent corresponding TCGA cancer types. According to the MeSH disease IDs for 16 cancers, we assembled the cancer genes from five databases, including OMIM [21], CTD [22], ClinVar [23], PheGenI [24] and DisGeNET [25] (Supplementary Table S3 and Data 1). Additionally, based on the TCGA patient samples with both gene expression data and clinical outcome data in each cancer type, we estimated the association between each gene and the 10-year overall survival outcomes by Cox proportional hazards regression (details in Supplementary Note 1). A gene with a P-value of 0.05 or less (log-rank test) was deemed to have a significant prognostic association, and genes with hazard ratio (HR) >1 (or z score > 0) and < 1 (or z score < 0) were considered adverse and favorable prognostic genes, respectively [26] (Supplementary Table S4).

SMGs and MDGs for individual patients and cell lines

Somatic mutations occur in any non-germ cell of the body after conception (e.g. those that initiate tumorigenesis), where the mutation that directly or indirectly confers a selective growth advantage to the cell in which it appears is regarded as the driver mutation [27]. Genes with somatic mutations and driver mutations are considered SMGs and MDGs, respectively. To validate whether the SINs for patient samples (or cell lines) are associated with individual features (e.g. SMGs and MDGs), we first downloaded TCGA mutation data obtained from whole-exome sequencing from the cBioPortal database [28]. Additionally, 10 types of mutations (missense, nonsense, nonstop, frameshift insertion or deletion, in-frame insertion or deletion, splice region or site and translation start site) were considered biologically relevant for specific genes and were further used to determine the SMGs for each sample [26]. Since the MDGs for each sample were not provided by TCGA, we defined a potential MDG as one that satisfies two criteria [10]: it is one of the SMGs of each sample and one of the 725 pan-cancer driver genes in humans.

For the 1019 CCLE cancer cell lines, mutation call data (version: public 19Q1) were collected from the DepMap portal. Next, we considered 14 types of mutations (missense, nonsense, nonstop, frameshift insertion or deletion, in-frame insertion or deletion, splice site, stop codon insertion or deletion, start codon insertion, deletion, or single nucleotide polymorphism and de novo start codon, which is out of frame) as biologically relevant for specific genes and then employed them to determine the SMGs for each cell line. Moreover, the driver mutation data for each cell line were assembled from Cell Model Passports (version: 20220315).

Drug–target interaction set and drug sensitivity data

To examine whether SINs could be used for personalized drug screening in cancers, we first collected high-quality physical drug–target interactions on Food and Drug Administration (FDA)-approved or clinically investigational drugs from the DrugBank [29], Binding DB [30] and ChEMBL [31] databases. Then, we defined a physical drug–target interaction using reported binding affinity data and kept only the drug–target interactions that satisfied the criteria suggested by Cheng et al. [32] (details in Supplementary Note 2). Based on drug–disease associations from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [33] and MEDication Indication resource [34] databases, we used the MeSH disease IDs of 16 TCGA cancer types to define which drugs could be indicated for certain cancers and employed the DrugBank ID to screen out 69 unique drugs in 13 cancers, where these drugs are FDA-approved or clinically investigational and associated with at least one of 549 human targets (Supplementary Table S5 and Data 2). Furthermore, we collected the drug sensitivity data of the GDSC1 and GDSC2 datasets from the Genomics of Drug Sensitivity in Cancer (GDSC) [35] database (details in Supplementary Note 2).

CRISPR knockout screening data from project Achilles

To study whether SINs could be used to identify essential genes in human cancer cell lines, we first downloaded the experimental data of the CRISPR knockout screens from Project Achilles [36] (version: DepMap Public 21Q1), containing the results of a genome-scale CRISPR knockout screen (18,119 genes in 808 cell lines). In these profiles, each gene was assigned a CERES score (also called dependency score) to reflect the genetic dependency in each cell line. A CERES score of 0 means a gene is not essential; correspondingly, a CERES score of −1 is comparable to the median of all pan-essential genes i.e. a significant gene-knockout effect in the cell line [36]. Thus, Project Achilles suggested a gene with a CERES score of ≤ −1 as an essential gene in the cell line. The numbers of essential genes defined by different thresholds (i.e. 0, −0.5, −1 and − 1.5) of the CERES scores in 808 cell lines are shown in Supplementary Table S6. The Pearson correlation coefficient (PCC), Spearman’s rank correlation coefficient and overlap coefficient were evaluated based on the CERES score and network degree matrices to examine the relationship between the degree distribution in each SIN and the genetic dependency in the corresponding cell line (details in Supplementary Note 3).

Construction of a SIN

In this study, we developed a method called SWEET to construct a SIN for each sample |${S}_p$| based on the gene expression profiles of a group of n case samples. To neutralize the network edge number bias caused by the intrinsic subpopulations (Supplementary Figure S1), the SWEET method first evaluates the PCC between every two samples based on their expression profiles across m genes to establish a sample-to-sample PCC matrix (Figure 1A). The average PCC (⁠|${\mu}_{PCC}^{\left({S}_p\right)}$|⁠) for each case sample |${S}_p$| is measured and then used to calculate the genome-wide sample weight, |${W}^{\left({S}_p\right)}$|⁠. |${W}^{\left({S}_p\right)}$| is defined as
(1)
where |${\mu}_{PCC}^{\left({S}_p\right)}$| is the mean of PCCs between case sample |${S}_p$| and each of the others, |${\mu}_{PCC}^{(S)}$| is the set of average PCCs for n samples and x is a constant value to avoid the sample weight being equal to 0 (here, x is set to 0.01).
Schematic illustration of SWEET construction. (A) Quantification of sample weights via the genome-wide expression profiles to adjust the sample size differences between intrinsic populations. For each sample ${S}_p$, the SWEET method estimates the PCCs of expression profiles between it and the other samples and transforms them into a genome-wide sample weight, ${W}^{\left({S}_p\right)}$. (B) Construction of the aggregate network for a group of case samples and the perturbed network for each case sample based on gene coexpression. The aggregate network can be established by calculating the PCC between the expression profiles of any two genes i and j (i.e. a gene pair, ${E}_{ij}^{(n)}$) across this group of samples (n samples). Next, a copy of each sample (${S}_p$) is added to the overall samples (n + 1 samples) to calculate the PCC of every gene pair, ${E}_{ij}^{\left(n+{S}_p\right)}$, to further build a perturbed network. (C) Our statistical model for the establishment of SINs by evaluating the difference between the aggregate and perturbed networks with the integration of genome-wide sample weights. The confidence of each edge is quantified by the SWEET score, ${E}_{ij}^{\left({S}_p\right)}$, and the significance level is evaluated by a z-test.
Figure 1

Schematic illustration of SWEET construction. (A) Quantification of sample weights via the genome-wide expression profiles to adjust the sample size differences between intrinsic populations. For each sample |${S}_p$|⁠, the SWEET method estimates the PCCs of expression profiles between it and the other samples and transforms them into a genome-wide sample weight, |${W}^{\left({S}_p\right)}$|⁠. (B) Construction of the aggregate network for a group of case samples and the perturbed network for each case sample based on gene coexpression. The aggregate network can be established by calculating the PCC between the expression profiles of any two genes i and j (i.e. a gene pair, |${E}_{ij}^{(n)}$|⁠) across this group of samples (n samples). Next, a copy of each sample (⁠|${S}_p$|⁠) is added to the overall samples (n + 1 samples) to calculate the PCC of every gene pair, |${E}_{ij}^{\left(n+{S}_p\right)}$|⁠, to further build a perturbed network. (C) Our statistical model for the establishment of SINs by evaluating the difference between the aggregate and perturbed networks with the integration of genome-wide sample weights. The confidence of each edge is quantified by the SWEET score, |${E}_{ij}^{\left({S}_p\right)}$|⁠, and the significance level is evaluated by a z-test.

Next, the PCC as the score of an edge (⁠|${E}_{ij}^{(n)}$|⁠) between genes i and j was measured based on their expression profiles across n case samples to construct the aggregate network (Figure 1B). For the establishment of the perturbed network, we made a copy of the expression profile for case sample |${S}_p$| to those of overall samples and then calculated the PCC of the edge between genes i and j across n + 1 case samples, |${E}_{ij}^{\left(n+{S}_p\right)}$|⁠. Finally, we evaluated the difference between the aggregate and perturbed networks with the integration of genome-wide sample weights to construct n SINs corresponding to n case samples (Figure 1C). The confidence score of an edge (⁠|${E}_{ij}^{\left({S}_p\right)}$|⁠) between genes i and j in a SIN is given as
(2)
where n is the scale factor (i.e. the total number of case samples) for adjusting the differential correlation, |${E}_{ij}^{\left(n+{S}_p\right)}-{E}_{ij}^{(n)}$| (Supplementary Figure S2), and K is the balance parameter (i.e. the percentage of the total number of samples). Based on the scale-free nature of most biological networks [37], systematic parameter variation provided evidence that the average degree exponent (⁠|$\gamma$|⁠) and the average coefficient of determination (⁠|${R}^2$|⁠) of SWEET SINs more frequently achieved the highest median values across the four z score thresholds when the parameter K was set to 10% (Supplementary Figure S3). Accordingly, the parameter K of 10% was used in this study. The larger and smaller |${W}^{\left({S}_p\right)}$| values indicate that sample |${S}_p$| is more similar and different to the other samples, implying that it may belong to a large subpopulation and a small subpopulation, respectively. It is predictable that the difference between the aggregate and perturbed networks for sample |${S}_p$| is relatively small (or large) if the case sample |${S}_p$| is part of a large (or small) subpopulation, and the |${W}^{\left({S}_p\right)}$| value is used to adjust this bias of difference. The significance level, |$z\left({E}_{ij}^{\left({S}_p\right)}\right)$|⁠, of the confidence score for the edge between genes i and j is then measured by a z-test.
(3)
where μ and σ are the mean and standard deviation, respectively, of confidence scores for all edges in the SINs of all samples. If the absolute z score of an edge is larger than the significance level, we will reject the null hypothesis, and two genes have an edge in the SIN of sample |${S}_p$|⁠. Here, the significance level was set to 1.960 or 2.576 (i.e. P-value |$\le$|0.05 or 0.01) for building SINs in the following analyses. The details of the SSN, LIONESS and CSN methods are described in Supplementary Note 4 and Table S7.

Results

Sample-specific-weighted correlation network

Figure 1 depicts the process of using the SWEET method to infer SINs with the confidence score for each edge by integrating the genome-wide sample-to-sample correlation with the differential network between perturbed and aggregate networks. Given gene expression data across multiple samples, the goal of the SWEET method is to model SINs to better estimate network-level similarity and diversity across a population. To examine the influence of sample size differences between subpopulations on constructing SINs using their gene expression data, we used 16 TCGA RNA-seq datasets to create the simulation sets containing large and small subpopulations with different folds between sample sizes (Supplementary Figure S1). Notably, the network variations of LIONESS, SSN and CSN SINs are apparently proportional to the sample size differences, reflecting a bias that these SINs in a larger subpopulation usually have fewer edge numbers. Conversely, we found that the SWEET method using the genome-wide sample weight can effectively avoid the bias caused by sample size differences between subpopulations to reduce the unreasonable network variation.

To further investigate whether sample weights could reflect sample size differences between subpopulations, we used the RNA-seq data of both normal and tumor samples in each of 16 TCGA cancers to measure the sample weights. Among these 16 cancers, each tumor sample size was larger than the corresponding normal sample size, ranging from 2.6-fold (i.e. kidney chromophobe, KICH) to 21.5-fold (i.e. bladder urothelial carcinoma, BLCA; Supplementary Table S1). As expected, the average sample weights for tumor samples were significantly higher than those for normal samples in 16 cancers (P = 0.0002, Wilcoxon signed-rank test; Supplementary Figure S4A). Furthermore, the sample weights not only could reflect sample size differences between subpopulations but were also not, to a certain extent, affected by similar genomes for tumors derived from the same/adjacent organs (details in Supplementary Note 5 and Figure S4B). Additionally, to further examine the influence of sample size differences between the whole sample set and its subsets on sample weight, we conducted a simulation analysis using 1101 tumor samples of BRCA, which has the largest sample size in TCGA (Supplementary Figure S4C). Twenty percent of tumor samples were randomly removed from the whole sample set of BRCA in turn to build four subsets (i.e. 80, 60, 40 and 20% of the total samples). The comparison results suggested the sample weights are generally independent of sample size differences between the whole sample set and its subsets.

Characteristics of SINs constructed by different methods

To quantify whether SINs are likely to exhibit scale-free network characteristics (Supplementary Note 6) [8, 37, 38], we first constructed the SINs by the SWEET, LIONESS, SSN and CSN methods using the TCGA RNA-seq data. Based on 6480 SINs in 16 cancers, the results showed that SWEET SINs displayed significantly higher median values of R2 than LIONESS, SSN and CSN SINs (P < 0.005, Wilcoxon signed-rank test) when the P-value thresholds of the z-test were set to 0.01, 0.005 and 0.001 (Figure 2A). Compared with the SINs generated by the other methods, the median values of the degree exponent (γ) for SWEET and CSN SINs across 16 cancers could achieve at least 1.0, especially with P-values of edges ≤0.005 (Supplementary Figure S5). These SWEET and CSN SINs with γ values ranging between 1 and 2 are consistent with the architecture of previously described biological networks [26, 39–41].

Quantification of SINs constructed by the SWEET, LIONESS, SSN and CSN methods based on TCGA RNA-seq data in 16 cancer types. (A) Distribution of average coefficient of determination (${R}^2$) values (i.e. scale-free topology fit index) in the SINs generated using four P-value thresholds. The average ${R}^2$ value of all SINs in each cancer type was used to represent the corresponding cancer. (B) Distribution of the average overlap coefficients between the edges of the STRING human PPI network and the top 0.1, 0.5, 1 and 5% edges ranked by the corresponding edge scores among the total edges of the fully connected network. In all boxplots, dots represent individual data points for 16 cancer types. The data of the different methods were statistically analyzed using the Wilcoxon signed-rank test. P-values of $<$0.05, $<$0.01 and $<$0.005 are indicated by an asterisk, a double asterisk and a triple asterisk, respectively. (C) Comparison of the average network densities for SINs in 16 cancers. Network density is the number of edges divided by the number of all the possible edges, $m\left(m-1\right)/2$, in each SIN, where m is the number of nodes in each SIN. The average network density for all SINs in each cancer type was used to represent the corresponding cancer. The average network densities for SINs constructed at a P-value threshold of ≤0.05 are labeled.
Figure 2

Quantification of SINs constructed by the SWEET, LIONESS, SSN and CSN methods based on TCGA RNA-seq data in 16 cancer types. (A) Distribution of average coefficient of determination (⁠|${R}^2$|⁠) values (i.e. scale-free topology fit index) in the SINs generated using four P-value thresholds. The average |${R}^2$| value of all SINs in each cancer type was used to represent the corresponding cancer. (B) Distribution of the average overlap coefficients between the edges of the STRING human PPI network and the top 0.1, 0.5, 1 and 5% edges ranked by the corresponding edge scores among the total edges of the fully connected network. In all boxplots, dots represent individual data points for 16 cancer types. The data of the different methods were statistically analyzed using the Wilcoxon signed-rank test. P-values of |$<$|0.05, |$<$|0.01 and |$<$|0.005 are indicated by an asterisk, a double asterisk and a triple asterisk, respectively. (C) Comparison of the average network densities for SINs in 16 cancers. Network density is the number of edges divided by the number of all the possible edges, |$m\left(m-1\right)/2$|⁠, in each SIN, where m is the number of nodes in each SIN. The average network density for all SINs in each cancer type was used to represent the corresponding cancer. The average network densities for SINs constructed at a P-value threshold of ≤0.05 are labeled.

To validate the identified edges of SINs generated by different methods, we first collected two reference networks, including the human PPI network from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database [42] and the human interactome assembled from 21 public databases by Gysi et al. [41]. According to the two reference networks, the results demonstrated that the average overlap coefficients of the selected edge sets (i.e. top 0.1, 0.5, 1 and 5% edges) for SWEET SINs were significantly higher than those for LIONESS, SSN and CSN SINs (P < 0.0005, Wilcoxon signed-rank test; Figure 2B and Supplementary Note 6 and Figure S6). Additionally, we observed that the average network densities for SSN (~29.0%) and CSN (~12.0%) SINs in 16 cancers under the same P-value threshold were apparently higher than those for SWEET (~4.2%) and LIONESS (~3.4%) SINs (Figure 2C). In contrast, the network densities for the STRING human PPI network and human interactome were 0.4 and 0.2%, respectively, implying that the edge numbers of SWEET and LIONESS SINs were relatively moderate. Taken together, these results suggest that our SWEET method can identify high-confidence edges and establish SINs with the characteristics of scale-free networks.

Comparisons of different SIN inference methods in identifying three types of cancer-related genes

Although interpatient heterogeneity may result in distinct orders of appearance of cancer driver genes (or other cancer-related genes), these genes are likely to be linked with numerous other genes (i.e. high-degree genes) to comprehensively regulate the pathways associated with tumorigenesis [11]. Based on this assumption, to benchmark the performance of the different SIN inference methods in detecting cancer driver genes, cancer genes and prognostic genes, we implemented these methods to infer the SINs and rank the genes based on their degrees (details in Supplementary Note 7 and Figure S7). Then, we examined the ranks of three types of cancer-related genes in the SINs and plotted their cumulative distributions [43]. Compared with the LIONESS, SSN and CSN methods and the differential expression analysis (DEA) method, our SWEET method achieved the best performance (i.e. the highest AUC) in detecting cancer driver genes (Figure 3A), cancer genes (Figure 3B) and prognostic genes (Figure 3C). To examine whether different network sizes of SINs inferred by these methods may cause bias in performance, we chose the same numbers of edges for LIONESS, SSN or CSN SINs according to the corresponding SWEET ones, in which the edges were ranked by their scores assessed by the respective methods. Similar results were observed, as shown in Supplementary Figure S8, suggesting that these cancer-relevant genes are more likely to be high-degree nodes in SWEET and SSN SINs.

Deviation of the cumulative distribution from uniform (DCDU) of the scaled ranks for benchmark genes based on their node degrees in the SINs generated by the SWEET, LIONESS, SSN and CSN methods at two P-value thresholds and their absolute values of log2 fold change between tumor and normal samples (i.e. DEA method) using the RNA-seq data across 16 cancers. The benchmark gene sets include (A) cancer driver genes, (B) cancer genes and (C) prognostic genes. The black solid lines denote the ranking distribution of randomly ordered benchmark gene sets. (D) The area under the curve (AUC) values as measures of the degree (or absolute value of log2 fold change) of deviation from uniform for three gene sets using different methods.
Figure 3

Deviation of the cumulative distribution from uniform (DCDU) of the scaled ranks for benchmark genes based on their node degrees in the SINs generated by the SWEET, LIONESS, SSN and CSN methods at two P-value thresholds and their absolute values of log2 fold change between tumor and normal samples (i.e. DEA method) using the RNA-seq data across 16 cancers. The benchmark gene sets include (A) cancer driver genes, (B) cancer genes and (C) prognostic genes. The black solid lines denote the ranking distribution of randomly ordered benchmark gene sets. (D) The area under the curve (AUC) values as measures of the degree (or absolute value of log2 fold change) of deviation from uniform for three gene sets using different methods.

To further compare the performance of SIN inference methods with 13 other mutation-based methods [44–56] for detecting cancer driver genes, the F1 score was used to assess their detection abilities based on a gold standard set containing driver genes collected from the CGC [18] and NCG [19] databases. The results demonstrated that the F1 scores of SWEET SINs were significantly higher (P < 0.05, Wilcoxon signed-rank test or Mann–Whitney U test) than those of the other three SIN inference methods and the other 13 mutation-based methods (Supplementary Figure S9). Notably, four SIN inference methods achieved better performance outcomes in identifying driver genes than most mutation-based methods. One of the possible reasons is that these SINs could display the interaction variations involved in the biological processes for different samples.

Characterization of individual features using SINs

Previous studies have suggested that no two tumor genomes carry the same collection of somatic mutations [57]. Thus, SMGs and MDGs in cancers are considered individual-specific features for characterizing tumor homogeneity and heterogeneity between patients. Compared with other genes, mutations of genes encoding hubs or genes near hubs are more likely to result in multiple phenotypic consequences [10]. To validate whether SINs can be used to characterize individual features, we first measured the proximity between the SMGs (or MDGs) and hubs of the corresponding SINs for each tumor sample or cell line (details in Supplementary Note 8). For tumor samples in 16 cancers, the results demonstrated that the average shortest path lengths (SPLs) between the SMGs and the hubs of SWEET SINs were significantly shorter (P < 0.01, Wilcoxon signed-rank test) than those for LIONESS, SSN and CSN SINs, irrespective of the hub definition (Supplementary Figure S10A). Compared with 100 random trials, SSN SINs also displayed more negative z scores for average SPLs, outperforming those for the LIONESS, SSN and CSN methods (P < 0.05; Figure 4A and Supplementary Figure S10B). Furthermore, the MDGs were more likely to be located near the hubs of SWEET SINs than the SINs derived from other methods (Figure 4B and Supplementary Figure S10C and D). Similar results for proximity measures between the SMGs (or MDGs) and SIN hubs using the RNA-seq data of 1019 CCLE cell lines were observed, as shown in Figure 4C and D and Supplementary Figure S11. Note that the RNA-seq data of 1019 CCLE cell lines (i.e. whole sample set) were used as the reference set to construct the SSN SINs [58] since the CCLE cell line data lack the corresponding normal (or control) samples. We also observed that MDGs tended to be closer to the SIN hubs than SMGs (Supplementary Figures S10A and C and S11A and C). Moreover, the percentages of MDGs gradually increase as their node degrees increase regardless of the SIN inference method or expression data source (i.e. TCGA or CCLE; Figure 4E and Supplementary Figure S12). These results suggest that SINs can be utilized to infer which genes in a specific sample are more likely to exhibit driver mutations even without DNA sequencing information.

Quantification of SIN inference methods in characterizing individual features. Comparisons of z scores for the average shortest path lengths (SPLs) between (A) the SMGs and SIN hubs, and (B) the MDGs and SIN hubs on the STRING human PPI network. The SINs were constructed by the SWEET, LIONESS, SSN and CSN methods at two P-value thresholds by using TCGA RNA-seq data in 16 cancer types, and the genes with degrees within the top 25% of all genes were deemed the hubs of each SIN. The z scores were calculated to quantify the significance of the average SPLs between the hubs and SMGs (or MDGs) on the STRING network. The mean z score for all SINs in each cancer type was used to represent the corresponding cancer. In all boxplots, dots represent individual data points for 16 cancer types. Violin plots of z scores for the average SPLs between (C) the SMGs and SIN hubs, and (D) the MDGs and SIN hubs on the STRING human PPI network. The SINs were built by the SWEET, LIONESS, SSN and CSN methods across two P-value thresholds using the RNA-seq data of 1019 cell lines from CCLE. Note that the RNA-seq data of 1019 CCLE cell lines were also used as the reference set to construct the SSN SINs since the CCLE cell line data lack the corresponding normal (or control) samples. P-values of $<$0.05, $<$0.01 and $<$0.005 (Wilcoxon signed-rank test) are indicated by an asterisk, a double asterisk and a triple asterisk, respectively. (E) The percentage of MDGs among SMGs in the top 1000-, 500-, 100-, 30-, 20- and 10-highest-degree genes of SINs constructed using TCGA RNA-seq data in 16 cancer types at a P-value threshold of ≤0.05. (F) Kernel density plot representation of overlap coefficients between the hub genes (i.e. top 25%) of SINs and the genes with a significant gene-knockout effect (i.e. CERES scores ≤ −1) in 580 cell lines.
Figure 4

Quantification of SIN inference methods in characterizing individual features. Comparisons of z scores for the average shortest path lengths (SPLs) between (A) the SMGs and SIN hubs, and (B) the MDGs and SIN hubs on the STRING human PPI network. The SINs were constructed by the SWEET, LIONESS, SSN and CSN methods at two P-value thresholds by using TCGA RNA-seq data in 16 cancer types, and the genes with degrees within the top 25% of all genes were deemed the hubs of each SIN. The z scores were calculated to quantify the significance of the average SPLs between the hubs and SMGs (or MDGs) on the STRING network. The mean z score for all SINs in each cancer type was used to represent the corresponding cancer. In all boxplots, dots represent individual data points for 16 cancer types. Violin plots of z scores for the average SPLs between (C) the SMGs and SIN hubs, and (D) the MDGs and SIN hubs on the STRING human PPI network. The SINs were built by the SWEET, LIONESS, SSN and CSN methods across two P-value thresholds using the RNA-seq data of 1019 cell lines from CCLE. Note that the RNA-seq data of 1019 CCLE cell lines were also used as the reference set to construct the SSN SINs since the CCLE cell line data lack the corresponding normal (or control) samples. P-values of |$<$|0.05, |$<$|0.01 and |$<$|0.005 (Wilcoxon signed-rank test) are indicated by an asterisk, a double asterisk and a triple asterisk, respectively. (E) The percentage of MDGs among SMGs in the top 1000-, 500-, 100-, 30-, 20- and 10-highest-degree genes of SINs constructed using TCGA RNA-seq data in 16 cancer types at a P-value threshold of ≤0.05. (F) Kernel density plot representation of overlap coefficients between the hub genes (i.e. top 25%) of SINs and the genes with a significant gene-knockout effect (i.e. CERES scores ≤ −1) in 580 cell lines.

To further examine whether high-degree genes in SINs tend to be essential in each cancer cell line, we assembled the CRISPR knockout screening data from Project Achilles [36] to produce a CERES score matrix. The more negative the CERES score of a gene in a cell line is, the higher its gene essentiality. Previous studies have demonstrated a positive correlation between essentiality and degree in the biological networks of several organisms (e.g. yeast, worm, fly, mouse and human) [59–61]. Therefore, we first wanted to observe whether the degree distributions of inferred SINs tend to have a negative correlation with the corresponding distributions of CERES scores. Among the 580 cell lines, the network degree matrices of SWEET, LIONESS, SSN and CSN SINs were significantly negatively correlated with the CERES score matrix than was expected by 100 random trials (P < |$6\times{10}^{-22}$|⁠, Mann–Whitney U Test; Supplementary Figure S13A and B). We also found that the network degrees of 17,916 genes in SWEET SINs showed a significantly higher negative correlation with their CERES scores than those in LIONESS, SSN and CSN SINs (P < |$2\times{10}^{-256}$|⁠, Wilcoxon signed-rank test; Supplementary Figure S13C and D); in other words, the essentiality and degree in SWEET SINs for 580 cell lines show a higher tendency to be positively correlated. Next, we explored whether the hubs of a SIN are more likely to be essential in each cell line, where each gene with CERES scores ≤0, −0.5, −1 (suggested threshold in Project Achilles) or − 1.5 is respectively considered an essential gene, which has a significant gene-knockout effect. The results showed that the overlap coefficients between the hubs of SWEET SINs and the essential genes in corresponding cell lines were significantly higher than those for the LIONESS, SSN and CSN methods, irrespective of the hub definition, essential gene definition or significance level of the edge score (P < |$8\times{10}^{-13}$|⁠, Wilcoxon signed-rank test; Figure 4F and Supplementary Figure S14A). Moreover, the hubs of SWEET, LIONESS and SSN SINs were significantly more likely to be the essential genes than those of random chance (P < |$3\times{10}^{-34}$|⁠, Mann–Whitney U Test; Supplementary Figure S14B and C), except for the CSN SINs. In summary, for characterizing individual features, our SWEET provides a new approach that enables the identification of potential genes with somatic mutations and driver mutations as well as essential genes.

SIN-based proximity measure for predicting drug–cancer relationships and drug efficacy

Because identifying individual patient-specific cancer-relevant genes remains a challenge, we hypothesized that the hubs of a SIN are closer to the cancer-relevant genes than the other genes; thus, a drug with targets localized near these hubs is more likely to be effective for the corresponding tumor sample or cell line (details in Supplementary Note 8). The results showed that the SWEET SINs displayed significantly more negative z scores for average SPLs between hubs and corresponding drug targets than those of the LIONESS, SSN and CSN methods, irrespective of binding affinity criterion for defining drug–target interactions (P < 0.05, Wilcoxon signed-rank test; Figure 5A and Supplementary Figures S15 and S16), implying that the drug targets are more likely to neighbor on the hubs of SWEET SINs.

Evaluation of SIN-based proximity for detecting drug–cancer relationships and predicting drug sensitivity in cancer cell lines. (A) Comparisons of z scores for the average SPLs between the SIN hubs and drug targets on the STRING human PPI network. The SINs were constructed by the SWEET, LIONESS, SSN and CSN methods across two P value thresholds using RNA-seq data in 13 cancer types. Note that only RNA-seq data in 13 cancer types were used in this analysis because 69 drugs are FDA-approved or clinically investigational for these cancers and associated with at least one human target (for details, see Supplementary Table S5 and Data 2). Here, we considered the genes with degrees within the top 25% of all genes as the hubs of each SIN. The mean z score for all SINs in each cancer type was used to represent the corresponding cancer. In all boxplots, dots represent individual data points for 13 cancer types. An asterisk indicates a P value of $<$0.05 (Wilcoxon signed-rank test). (B) Distribution of the average SPLs between the SIN hubs and drug targets on the STRING human PPI network under different half-maximal inhibitory concentration (IC50) values provided by the GDSC database based on the cell viability assays. The SINs were generated by four SIN inference methods using the RNA-seq data of 636 cell lines from CCLE. The error bars indicate the mean ± standard deviation. (C) Violin plots of z scores for the average SPLs between the drug targets and SIN hubs for 636 cell lines, of which only drugs with an IC50 value of ≤10 $\mu \mathrm{M}$ for the corresponding cell lines were used in this analysis. The percentage of effective drug–cell line pairs (i.e. IC50 value of ≤10 $\mu \mathrm{M}$) with significant proximity (i.e. z score of $\le$-1.96) is shown on the bottom right corner of each plot. A triple asterisk indicates a P value of $<$0.005 (Wilcoxon signed-rank test). (D) Drug-induced cytotoxicity of IC50 values determined by the MTT assay on 96-well plates. SCC-25 and A549 cancer cells were treated with candidate repurposable drugs (encorafenib, albendazole, bleomycin, etoposide, vincristine, panobinostat and belinostat) at various concentrations (0.005, 0.01, 0.05, 0.1, 0.5, 1, 5 and/or 10 μM) for 48 h. The concentrations are shown on a logarithmic (log) scale of nM (x-axis). The data are representative of at least three independent experiments. The error bars indicate the mean ± standard deviation. The IC50 values of these seven drugs from this study are shown.
Figure 5

Evaluation of SIN-based proximity for detecting drug–cancer relationships and predicting drug sensitivity in cancer cell lines. (A) Comparisons of z scores for the average SPLs between the SIN hubs and drug targets on the STRING human PPI network. The SINs were constructed by the SWEET, LIONESS, SSN and CSN methods across two P value thresholds using RNA-seq data in 13 cancer types. Note that only RNA-seq data in 13 cancer types were used in this analysis because 69 drugs are FDA-approved or clinically investigational for these cancers and associated with at least one human target (for details, see Supplementary Table S5 and Data 2). Here, we considered the genes with degrees within the top 25% of all genes as the hubs of each SIN. The mean z score for all SINs in each cancer type was used to represent the corresponding cancer. In all boxplots, dots represent individual data points for 13 cancer types. An asterisk indicates a P value of |$<$|0.05 (Wilcoxon signed-rank test). (B) Distribution of the average SPLs between the SIN hubs and drug targets on the STRING human PPI network under different half-maximal inhibitory concentration (IC50) values provided by the GDSC database based on the cell viability assays. The SINs were generated by four SIN inference methods using the RNA-seq data of 636 cell lines from CCLE. The error bars indicate the mean ± standard deviation. (C) Violin plots of z scores for the average SPLs between the drug targets and SIN hubs for 636 cell lines, of which only drugs with an IC50 value of ≤10 |$\mu \mathrm{M}$| for the corresponding cell lines were used in this analysis. The percentage of effective drug–cell line pairs (i.e. IC50 value of ≤10 |$\mu \mathrm{M}$|⁠) with significant proximity (i.e. z score of |$\le$|-1.96) is shown on the bottom right corner of each plot. A triple asterisk indicates a P value of |$<$|0.005 (Wilcoxon signed-rank test). (D) Drug-induced cytotoxicity of IC50 values determined by the MTT assay on 96-well plates. SCC-25 and A549 cancer cells were treated with candidate repurposable drugs (encorafenib, albendazole, bleomycin, etoposide, vincristine, panobinostat and belinostat) at various concentrations (0.005, 0.01, 0.05, 0.1, 0.5, 1, 5 and/or 10 μM) for 48 h. The concentrations are shown on a logarithmic (log) scale of nM (x-axis). The data are representative of at least three independent experiments. The error bars indicate the mean ± standard deviation. The IC50 values of these seven drugs from this study are shown.

It is difficult to test the drug efficacy in individual cancer patients directly. Therefore, to further validate whether SINs could be used for drug efficacy screening, we built SINs for 636 cell lines and found that the average SPLs between SIN hubs and drug targets begin to increase when a drug has an IC50 of >100 μM for a cell line regardless of the SIN inference method or hub definition (Figure 5B and Supplementary Figure S17A). Next, we focused on the drugs with an IC50 value of ≤10 μM for the corresponding cell lines and found that the SWEET SINs displayed significantly more negative z scores for average SPLs between the hubs and these drug targets than the LIONESS, SSN and CSN SINs (P < 0.005, Wilcoxon signed-rank test; Figure 5C and Supplementary Figure S17B). Among the drug–cell line pairs with IC50 values of ≤10 μM, ≥92, ≤61, ≤63 and ≤ 63% of drug–cell line pairs had significant proximity (i.e. z scores of ≤ − 1.96) in the SWEET, LIONESS, SSN and CSN SINs, respectively. Moreover, compared with LIONESS, SSN and CSN SINs, SWEET SINs achieved better F1 scores, regardless of the proximity measure (i.e. average SPL or z score), the significance level of the edge score or the hub definition (Supplementary Figure S18).

To examine whether SWEET SINs can be applied to detect repurposable drugs for both HNSCC (SCC-25 cell line) and LUAD (A549 cell line), we chose seven candidates (encorafenib, bleomycin, vincristine, etoposide, albendazole, panobinostat and belinostat) according to the network proximity and overlap ratio between SIN hubs and drug targets (details in Supplementary Note 9 and Figure S19). Based on the colorimetric determination of the 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium (MTT) assay, the cell viability after drug treatments in HNSCC (Figure 5D, upper panel) and LUAD (Figure 5D, middle panel) was evaluated (details in Supplementary Note 10). The IC50 values showed that almost all seven drug candidates obtained effective IC50 values in SCC-25 and A549 cells (≤3.07 μM; 11/14 = 79%), whereas only encorafenib, bleomycin and etoposide had IC50 values of >10 μM in A549 cells (Figure 5D, lower panel). Furthermore, among eight drug–cell line pairs without sensitivity data (n/a) in the GDSC database, we experimentally confirmed that six pairs (75%) displayed IC50 values of <10 μM. The IC50 values of albendazole-treated SCC-25 and A549 cells were determined to be 0.53 and 2.26 μM, respectively, in this study, showing a concordance of cytotoxicity with previous investigations [62, 63]. Additionally, encorafenib, a kinase inhibitor, is used for melanoma treatment [64] and we found that encorafenib exhibited high cytotoxicity with an IC50 of 0.72 μM in SCC-25 cells, suggesting its strong anticancer potential in HNSCC.

Network-based sample clustering for detecting cancer types/subtypes

In recent years, the identification of cancer types/subtypes has become a clear and unmet need to advance the precision diagnosis and therapy of cancers [65]. To evaluate whether SINs could identify the cancer types and subtypes, we first generated the network degree matrix of SINs for seven distinct TCGA cancer types derived from the same/adjacent organs (details in Supplementary Note 11). The best results under different edge thresholds (P-values), dimension reduction methods [principal component analysis [66] and/or t-distributed stochastic neighbor embedding (t-SNE) [67]], clustering methods (hierarchical clustering and k-means clustering) and performance measures (F1 score and adjusted rand index) were obtained with SWEET SINs (Supplementary Tables S8 and S9). Similar results for distinguishing 26 types among 1019 CCLE cell lines were observed.

In the visualization with dimension reduction, our results showed that the SWEET method could clearly identify different cancer types, except for colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ) (Figure 6A and Supplementary Figure S20). Because the distinction between the colon and the rectum is largely anatomical [68], COAD and READ have highly similar genomic data and are frequently investigated together [28]. Moreover, we observed that ~96% (494/517) of LUAD samples were grouped into Clusters 5 and 7. Thus, we eliminated the samples of other cancer types from these two clusters and further divided the LUAD samples into two parts, Subtypes 1 (203 samples) and 2 (291 samples; Figure 6A). We found important associations between the two predicted subtypes and clinical variables (Supplementary Figure S21 and Table S10). Tumors in Subtype 1 were frequently diagnosed in male smokers below 65 years of age with lesions on the left (or right) upper sides and presented in nearby lymph nodes or spread to other parts of the body. Conversely, tumors in Subtype 2 were mainly diagnosed in female nonsmokers aged 65 years or above and tended to not be present in nearby lymph nodes (N0) and not have spread to other parts of the body (M0). Although our results suggest that SINs could offer clues for observing putative subtypes in LUAD, these two predicted subtypes, related clinical variables and detailed molecular mechanisms still need to be experimentally validated and possibly refined in the future to avoid that they are driven by confounding variables.

The analysis of the dark gene TBC1D8 (TBC1 domain family member 8). (A) t-SNE plots of degree matrices for the SWEET SINs of 2289 samples across seven cancers, colored by cancer type (left panel) and predicted cluster (right panel). The clustering analysis was performed using hierarchical clustering. The LUAD samples could be divided into two subtypes (Subtypes 1 and 2) based on the degree matrix generated by our SWEET method. (B) The gene expression level (top) and network degree level (bottom) of TBC1D8. (C) Kaplan–Meier survival plots of differences in 10-year overall survival for LUAD patients stratified by the median value (50%) into groups with high or low expression values (top) and high or low degree values (bottom) of TBC1D8. The 95% confidence intervals are displayed in parentheses. HR, hazard ratio. Curve separation was assessed by a log-rank test.
Figure 6

The analysis of the dark gene TBC1D8 (TBC1 domain family member 8). (A) t-SNE plots of degree matrices for the SWEET SINs of 2289 samples across seven cancers, colored by cancer type (left panel) and predicted cluster (right panel). The clustering analysis was performed using hierarchical clustering. The LUAD samples could be divided into two subtypes (Subtypes 1 and 2) based on the degree matrix generated by our SWEET method. (B) The gene expression level (top) and network degree level (bottom) of TBC1D8. (C) Kaplan–Meier survival plots of differences in 10-year overall survival for LUAD patients stratified by the median value (50%) into groups with high or low expression values (top) and high or low degree values (bottom) of TBC1D8. The 95% confidence intervals are displayed in parentheses. HR, hazard ratio. Curve separation was assessed by a log-rank test.

Originally, the non-coding and unannotated DNA sequences and the genes without known function were called ‘dark’ matters [69] and ‘dark’ genes [70], respectively. Recently, the concept of ‘dark’ ones has been extended to describe a gene (or sequence region) with a specific alteration that may be relevant to disease but cannot be detected by typical methods or technologies. For example, a number of sequence regions with mutations, called ‘dark’ regions, cannot be sufficiently assembled or aligned using standard short-read sequencing technologies [71]. Some genes without differential expression are named ‘dark’ genes, which are often overlooked in the traditional differential analyses but may play important roles in disease since they may reflect variation regarding regulations, interactions or even networks [10, 12]. Here, to further investigate whether some genes can be signatures to separate these two subtypes based on network characteristics, we then identified a number of ‘dark’ genes that have a significant difference at the network degree level between samples of the two subtypes, not at the gene expression level (details in Supplementary Note 12). Supplementary Figure S22 illustrates the top 25 dark genes ranked by adjusted P-values of differential degree. It is clear that the TBC1D8 (encoding TBC1 domain family member 8; Figure 6B) and H2AFY2 (encoding core histone macro-H2A.2; Supplementary Figure S23A) genes display higher degrees in Subtype 1 than in Subtype 2; however, there were no significant differences for these two genes between the two subtypes at the gene expression level.

Additionally, compared with the gene expression of TBC1D8 not being significantly associated with a survival outcome in LUAD (Figure 6C, upper panel), there is added value in taking into account the network degrees of TBC1D8 in SINs, thus achieving higher prognostic power for favorable outcomes (Figure 6C, lower panel). To further investigate the mechanisms of the TBC1D8 gene in the two predicted LUAD subtypes in individual patients, we implemented the random walk with restart (RWR) algorithm in SWEET SINs to extract subnetworks centered on TBC1D8 (Supplementary Figure S24). The pathway enrichment analysis revealed that genes ‘downregulated’ in the tumor sample of patient TCGA-55-6972-01A-11R-1949-07 (LUAD subtype 1) compared with the normal samples were significantly associated with metastasis-relevant pathways, such as focal adhesion and proteoglycans in cancer (Figure 7, left panel). Conversely, according to subnetworks for the total of 59 normal samples, the pathway of focal adhesion was only overrepresented (i.e. enriched) in the ‘upregulated’ genes for two normal samples (versus all tumor samples i.e. opposite enrichment pattern), and the pathways of proteoglycans in cancer and Fc gamma R-mediated phagocytosis were not overrepresented in the gene groups of any normal samples (Supplementary Figure S25). This patient had been alive for >4 years on the last follow-up date. On the other hand, the pathway of small cell lung cancer was overrepresented in the ‘upregulated’ genes for another patient, TCGA-55-6979-01A-11R-1949-07 (LUAD subtype 2; Figure 7, right panel), but only overrepresented in the ‘downregulated’ genes for two normal samples (versus all tumor samples i.e. opposite enrichment pattern; Supplementary Figure S25). According to TCGA clinical information, this patient had passed away, and the number of days to death since the initial diagnosis was 237 (<1 year). Previous work has indicated that TBC1D8 can drive ovarian cancer tumorigenesis and metabolic reprogramming [72]. An eight-gene predictor set, including TBC1D8, has been developed to distinguish pancreatic cancer patients from healthy controls [73]. These findings imply that the network degrees of the TBC1D8 gene are more likely to be associated with survival outcomes than its gene expression in lung cancer patients, a possible reason for why it was neglected in previous lung cancer studies. Similarly, the network degrees of the H2AFY2 gene were significantly associated with adverse outcomes in LUAD, but its gene expression values were not (details in Supplementary Note 13, Figures S24, S26 and S27). In short, these results suggest the unique potential of the SWEET SINs in detecting cancer types/subtypes and in the prognostic analysis or prediction of LUAD patients.

Individual-specific subnetworks related to the TBC1D8 gene in the two predicted subtypes of LUAD. The subnetworks centered on the TBC1D8 gene were derived from the SWEET SINs for the patients TCGA-55-6972-01A-11R-1949-07 in LUAD Subtype 1 and TCGA-55-6979-01A-11R-1949-07 in LUAD Subtype 2 using the RWR algorithm (for details, see Supplementary Figure S24A). To provide a concise representation, the network was trimmed by keeping at most the top 25 edges for each node ranked by edge scores, except for the TBC1D8 gene. For each gene, the log2 fold change (log2 FC) of genes between their expression values in a specific tumor sample and the average expression value across all normal samples are displayed in a color scheme (red: upregulated; blue: downregulated; white: no change). The gray solid circles and dashed lines indicate no corresponding nodes and edges in the subnetwork, respectively. The enriched KEGG pathways (hypergeometric test, FDR q-value $\le$0.05) are shown for upregulated/downregulated genes in the distinct subnetworks.
Figure 7

Individual-specific subnetworks related to the TBC1D8 gene in the two predicted subtypes of LUAD. The subnetworks centered on the TBC1D8 gene were derived from the SWEET SINs for the patients TCGA-55-6972-01A-11R-1949-07 in LUAD Subtype 1 and TCGA-55-6979-01A-11R-1949-07 in LUAD Subtype 2 using the RWR algorithm (for details, see Supplementary Figure S24A). To provide a concise representation, the network was trimmed by keeping at most the top 25 edges for each node ranked by edge scores, except for the TBC1D8 gene. For each gene, the log2 fold change (log2 FC) of genes between their expression values in a specific tumor sample and the average expression value across all normal samples are displayed in a color scheme (red: upregulated; blue: downregulated; white: no change). The gray solid circles and dashed lines indicate no corresponding nodes and edges in the subnetwork, respectively. The enriched KEGG pathways (hypergeometric test, FDR q-value |$\le$|0.05) are shown for upregulated/downregulated genes in the distinct subnetworks.

Discussion and conclusion

Extracting inherent biological system information (e.g. cellular networks) from genome-wide expression profiles is an emergent need for developing personalized diagnostic and therapeutic strategies but remains a major challenge. SWEET has unique advantages over related methods and tools [3, 10–12] (details in Supplementary Note 14 and Table S7). First, in contrast to the LIONESS, SSN and CSN methods, our inferred SINs indeed effectively reduce the abnormal network variation. Second, the SWEET method was developed to construct a close-to-real network (e.g. with a scale-free topology) on a single-sample basis from RNA-seq data, which means one network for one sample, including both sample-specific and common edges. Third, by assessing the network degree for each gene in SINs, our SWEET method had better predictive abilities for different types of cancer-related genes, outperforming the other SIN inference methods and the DEA method. Although extending the analysis to structural network control (SNC) methods is beyond the scope of this paper, it is predictable that these SINs could be further applied to SNC methods [11, 74–78] to improve the driver gene detection ability. Fourth, multiple RNA-seq datasets of 7177 patient samples and 1019 cell lines across human cancers were included to establish the SINs and validate the accuracy, robustness and applicability of the SWEET method. Fifth, integrating SWEET SINs with a network proximity measure facilitates individual characterization of disease for developing personalized medicine, such as albendazole and encorafenib as two potential antitumor agents for HNSCC and LUAD; however, the detailed mechanisms remain to be fully elucidated. Finally, the SWEET method has utility in identifying cancer types/subtypes from 2289 tumor samples and 1019 cell lines and is superior to those of the other methods, also suggesting two possible subtypes of LUAD that exhibit distinct clinical features (e.g. age, gender, smoking history and TNM staging). According to these two subtypes, we also found some ‘dark’ genes with a differential network degree but no differential gene expression between samples of the two subtypes, such as the TBC1D8 and H2AFY2 genes. Using the SWEET SINs in conjunction with overall survival data, we further suggest that the TBC1D8 and H2AFY2 genes can be considered favorable and adverse prognostic genes in LUAD, respectively. Indeed, these potential prognostic genes and their detailed molecular mechanisms still require future investigation and experimental validation. There are several possible biological significance behind these ‘dark’ genes. For example, sometimes alterations (e.g. mutations) in two or more genes produce a phenotype that differs from the phenotype expected when the genes were independent of each other. This phenomenon, called genetic interaction, can uncover functional relationships between genes and pathways [79]. Additionally, missense mutations in structural genes have been indicated to affect protein stability and/or interfere with protein–protein binding affinity [80]. Post-translational modifications (PTMs) have been shown to modulate PPIs for performing their function [81]. Therefore, specific alterations (e.g. mutations and PTMs) may cause the genes/gene products to have a differential change in terms of interactions and regulations at the network level but not at the gene expression level; still, these remain to be further validated in the future.

SWEET has several limitations, challenges and perspectives (details in Supplementary Note 15). First, our SWEET method constructed the SINs using a correlation measure, of which the inferred gene associations comprise direct and indirect associations. Hence, in the future, we aim to determine causal relationships (i.e. direct or indirect) and even directions between two genes in SINs. For example, the partial correlation-based single-sample network (P-SSN) method [82] uses a partial correlation to infer direct associations by excluding indirect ones to build the SINs. In the analysis of scale-free network characteristics, the results showed that P-SSN SINs displayed significantly higher median values of R2 than the other SINs (P < 0.05, Wilcoxon signed-rank test; Supplementary Figure S28A). Compared with the SINs generated by the other methods, the median values of the γ for P-SSN and CSN SINs across 16 cancers could achieve at least 1.0 when the P-value thresholds of the z-test were set to 0.01 and 0.05 (Supplementary Figure S28B). We further observed that the average network densities for P-SSN (~1.2%), SWEET (~4.2%) and LIONESS (~3.4%) SINs in 16 cancers were relatively moderate and close to those for the STRING human PPI network (0.4%) and human interactome (0.2%; Supplementary Figure S28C). Indeed, these results suggest that the partial correlation can exclude some indirect associations to establish SINs with the characteristics of scale-free networks. Additionally, similar to the comparison result provided in the reference [82], we also found that the P-SSN method performed better in MDG prediction than the SSN method. Generally, the P-SSN method achieved the second-best or third-best performance in characterizing individual features [e.g. SMGs (Supplementary Figure S29A) and MDGs (Supplementary Figure S29B and C)], detecting cancer driver genes (Supplementary Figure S29D), and predicting drug–cancer relationships (Supplementary Figure S30). Taken together, the P-SSN method is suggested to be used if the researchers would like to exclude indirect associations and focus on studying the correlation/association change among genes of a single sample against the reference network. In contrast, our SWEET method aims to infer each close-to-real SIN, including both sample-specific and common edges, from a group of samples. In the future, it can be expected to build SINs closer to real by integrating the SWEET and P-SSN methods to address the network variation resulting from an imbalance of intrinsic subpopulation sizes and exclude indirect associations simultaneously.

Second, one of the aims of this study was to neutralize sample size bias without prior knowledge of subpopulations. If the subpopulations were predefined and labeled, it is predictable that using the normalized sample weight measured by the corresponding models (e.g. linear, nonlinear or bimodal) could make the inferred SINs closer to real. Finally, similar to other SIN inference methods [10–12], the SWEET method also requires a certain number of samples to achieve a sufficient statistical basis due to the need for correlation calculation. For example, we found that the performance of the SIN-based proximity measure in identifying SMGs, MDGs and drug–cancer relationships using the SWEET, SSN and CSN methods on the RNA-seq data of KICH (65 tumor samples) was worse than that of other cancer types (≥94 tumor samples). To test the stability of the SWEET method, we conducted a simulation analysis using RNA-seq data of BRCA in TCGA. First, 20, 40, 60 and 80% of samples were randomly selected from the whole sample set of BRCA to build four subsets, where the case samples in each subset were used as the observed ones. Based on the whole sample set and four subsets, we found that the average overlap ratios of edges in the SINs are 79.2, 87.4, 91.7 and 94.9% when using 20, 40, 60 and 80% of all samples as the reference set, respectively (Supplementary Table S11). According to 100 random trials, the standard deviation of these average overlap ratios ranged from 0.6 to 1.2%. These results suggest that the different reference sample sizes have a limited effect on the measure of SWEET SINs, implying the SWEET method is robust and stable. Additionally, we observed that the average overlap ratios for the comparison group between the whole sample set and the 20% subset are slightly lower than those for the other comparison groups. One of the possible reasons is that the sample size distribution between subpopulations in the reference set may have been changed when 80% of the samples were removed; for example, the sample sizes of the small subpopulations may have become too small or zero. This is reminiscent of the concordance in the findings that sample-specific differential networks tend to have the same structure if the reference sample size is sufficiently large and the reference sample sets follow the same distribution [58]. The recent rapid advances in sequencing technologies will enable the collection of a substantial amount of expression data for a large number of samples/cells [83, 84], which could be expected to overcome this limitation. By applying the network degree matrix of SINs in identifying cell types on four single-cell RNA sequencing (scRNA-seq) datasets [85, 86], our SWEET method generally achieved the best (bold italic letters) or the second-best (bold letters) clustering performances (Supplementary Note 16 and Tables S12 and S13). Although the performances still need to be further improved in the future, these results suggest that our SWEET method could be applied to distinguishing cell types on scRNA-seq data, which are generally noisier than RNA-seq of solid tumor biopsies.

In conclusion, our results shed light on the influence of subpopulation sample sizes on the construction of SINs and propose a solution to address this problem. Our strategy also promotes the identification of vital genes in individual sample networks as promising targets for the development of biomarkers and therapeutic targets. To our knowledge, our SWEET method provides a useful framework to facilitate the discovery of personalized characteristics, disease subtypes, individual therapies and further clinical applications.

Key Points
  • SWEET is a new computational method focusing on inferring SINs with a scale-free topology for every sample from gene expression data.

  • SWEET method systematically assesses the genome-wide sample-to-sample correlation to address the network edge number bias caused by sample size differences between intrinsic subpopulations in a group of samples.

  • Multiple RNA-seq datasets of 7177 patient samples and 1019 cell lines across human cancers were included to validate that SWEET achieved better performance for the discovery of the individual features and therapy in cancers than current SIN inference methods.

Acknowledgement

We thank Bo-Tai Liao for fruitful discussions about the influence of sample size differences on sample weight.

Funding

Young Scholar Fellowship Program by the Ministry of Science and Technology (MOST) in Taiwan, under Grant MOST 111–2636-B-A49–010 and MOST 111–2628-B-038-002; and the Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B) of the Higher Education Sprout Project (HESP) by the Ministry of Education (MOE), Taiwan. Funding for open access charge: Ministry of Science and Technology, Taiwan.

Code and data availability

Codes for SWEET are freely available in the GitHub repository (https://github.com/SysMednet/SWEET). Data are incorporated into the article and its online supplementary material.

Hsin-Hua Chen received his MS degree in Bioinformatics from the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan, in 2021. He is currently working on experimental data digitalization and automated analysis development. His research interests are bioinformatics, gene expression analysis and network inference.

Chun-Wei Hsueh is an MS student from the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan. His research interests are bioinformatics, systems biology and drug discovery.

Chia-Hwa Lee is a professor at the School of Medical Laboratory Science and Biotechnology, Taipei Medical University, Taipei, Taiwan. He specializes in molecular cell biology and anti-cancer drug development, with a specific focus on identifying oncotargets through CRISPR/Cas9 gene editing on various cancer types.

Ting-Yi Hao is an MS student from the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan. His research interests are bioinformatics, network inference, single-cell analysis and statistical model development.

Tzu-Ying Tu is an MS student from the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan. Her research interests are bioinformatics, network biology and systems biology.

Lan-Yun Chang is a PhD student from the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan. Her research interests are bioinformatics, systems biology and drug discovery.

Jih-Chin Lee is a professor and director of the Department of Otolaryngology-Head and Neck Surgery, Tri-Service General Hospital, National Defense Medical Center, Taiwan. His research interests include head and neck cancer research and thyroid diseases.

Chun-Yu Lin is an assistant professor at the Institute of Bioinformatics and Systems Biology and the Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Taiwan. He is also affiliated with the Center for Intelligent Drug Systems and Smart Bio-devices, National Yang Ming Chiao Tung University, Taiwan, and the School of Dentistry, Kaohsiung Medical University, Taiwan. His research interests include bioinformatics, network biology, network medicine and single-cell analysis.

References

1.

Schadt
EE
.
Molecular networks as sensors and drivers of common human diseases
.
Nature
2009
;
461
:
218
23
.

2.

Barabasi
AL
,
Gulbahce
N
,
Loscalzo
J
.
Network medicine: a network-based approach to human disease
.
Nat Rev Genet
2011
;
12
:
56
68
.

3.

Kuijjer
ML
,
Tung
MG
,
Yuan
G
, et al.
Estimating sample-specific regulatory networks
.
iScience
2019
;
14
:
226
40
.

4.

Orchard
S
,
Ammari
M
,
Aranda
B
, et al.
The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases
.
Nucleic Acids Res
2014
;
42
:
D358
63
.

5.

Oughtred
R
,
Rust
J
,
Chang
C
, et al.
The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions
.
Protein Sci
2021
;
30
:
187
200
.

6.

De Smet
R
,
Marchal
K
.
Advantages and limitations of current network inference methods
.
Nat Rev Microbiol
2010
;
8
:
717
29
.

7.

van
Dam
S
,
Vosa
U
,
van der
Graaf
A
, et al.
Gene co-expression analysis for functional classification and gene-disease predictions
.
Brief Bioinform
2018
;
19
:
575
92
.

8.

Langfelder
P
,
Horvath
S
.
WGCNA: an R package for weighted correlation network analysis
.
BMC Bioinformatics
2008
;
9
:
559
.

9.

Marbach
D
,
Costello
JC
,
Kuffner
R
, et al.
Wisdom of crowds for robust gene network inference
.
Nat Methods
2012
;
9
:
796
804
.

10.

Liu
X
,
Wang
Y
,
Ji
H
, et al.
Personalized characterization of diseases using sample-specific networks
.
Nucleic Acids Res
2016
;
44
:
e164
.

11.

Guo
WF
,
Zhang
SW
,
Zeng
T
, et al.
A novel network control model for identifying personalized driver genes in cancer
.
PLoS Comput Biol
2019
;
15
:
e1007520
.

12.

Dai
H
,
Li
L
,
Zeng
T
, et al.
Cell-specific network constructed by single-cell RNA sequencing data
.
Nucleic Acids Res
2019
;
47
:
e62
.

13.

Lotvall
J
,
Akdis
CA
,
Bacharier
LB
, et al.
Asthma endotypes: a new approach to classification of disease entities within the asthma syndrome
.
J Allergy Clin Immunol
2011
;
127
:
355
60
.

14.

Swanton
C
.
Intratumor heterogeneity: evolution through space and time
.
Cancer Res
2012
;
72
:
4875
82
.

15.

Cancer Genome Atlas Research N
,
Weinstein
JN
,
Collisson
EA
, et al.
The cancer genome atlas pan-cancer analysis project
.
Nat Genet
2013
;
45
:
1113
20
.

16.

Ghandi
M
,
Huang
FW
,
Jane-Valbuena
J
, et al.
Next-generation characterization of the cancer cell line Encyclopedia
.
Nature
2019
;
569
:
503
8
.

17.

Zhu
Y
,
Qiu
P
,
Ji
Y
.
TCGA-assembler: open-source software for retrieving and processing TCGA data
.
Nat Methods
2014
;
11
:
599
600
.

18.

Sondka
Z
,
Bamford
S
,
Cole
CG
, et al.
The COSMIC cancer gene census: describing genetic dysfunction across all human cancers
.
Nat Rev Cancer
2018
;
18
:
696
705
.

19.

Repana
D
,
Nulsen
J
,
Dressler
L
, et al.
The network of cancer genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens
.
Genome Biol
2019
;
20
:
1
.

20.

Dhammi
IK
,
Kumar
S
.
Medical subject headings (MeSH) terms
.
Indian J Orthop
2014
;
48
:
443
4
.

21.

Hamosh
A
,
Scott
AF
,
Amberger
JS
, et al.
Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders
.
Nucleic Acids Res
2005
;
33
:
D514
7
.

22.

Davis
AP
,
Grondin
CJ
,
Johnson
RJ
, et al.
Comparative toxicogenomics database (CTD): update 2021
.
Nucleic Acids Res
2021
;
49
:
D1138
43
.

23.

Landrum
MJ
,
Lee
JM
,
Riley
GR
, et al.
ClinVar: public archive of relationships among sequence variation and human phenotype
.
Nucleic Acids Res
2014
;
42
:
D980
5
.

24.

Ramos
EM
,
Hoffman
D
,
Junkins
HA
, et al.
Phenotype-genotype integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources
.
Eur J Hum Genet
2014
;
22
:
144
7
.

25.

Pinero
J
,
Bravo
A
,
Queralt-Rosinach
N
, et al.
DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants
.
Nucleic Acids Res
2017
;
45
:
D833
9
.

26.

Lin
CY
,
Lee
CH
,
Chuang
YH
, et al.
Membrane protein-regulated networks across human cancers
.
Nat Commun
2019
;
10
:
3131
.

27.

Vogelstein
B
,
Papadopoulos
N
,
Velculescu
VE
, et al.
Cancer genome landscapes
.
Science
2013
;
339
:
1546
58
.

28.

Gao
J
,
Aksoy
BA
,
Dogrusoz
U
, et al.
Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal
.
Sci Signal
2013
;
6
:
pl1
.

29.

Wishart
DS
,
Knox
C
,
Guo
AC
, et al.
DrugBank: a knowledgebase for drugs, drug actions and drug targets
.
Nucleic Acids Res
2008
;
36
:
D901
6
.

30.

Liu
T
,
Lin
Y
,
Wen
X
, et al.
BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities
.
Nucleic Acids Res
2007
;
35
:
D198
201
.

31.

Gaulton
A
,
Bellis
LJ
,
Bento
AP
, et al.
ChEMBL: a large-scale bioactivity database for drug discovery
.
Nucleic Acids Res
2012
;
40
:
D1100
7
.

32.

Cheng
F
,
Kovacs
IA
,
Barabasi
AL
.
Network-based prediction of drug combinations
.
Nat Commun
2019
;
10
:
1197
.

33.

Kanehisa
M
,
Goto
S
.
KEGG: Kyoto encyclopedia of genes and genomes
.
Nucleic Acids Res
2000
;
28
:
27
30
.

34.

Wei
WQ
,
Cronin
RM
,
Xu
H
, et al.
Development and evaluation of an ensemble resource linking medications to their indications
.
J Am Med Inform Assoc
2013
;
20
:
954
61
.

35.

Yang
W
,
Soares
J
,
Greninger
P
, et al.
Genomics of drug sensitivity in cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells
.
Nucleic Acids Res
2013
;
41
:
D955
61
.

36.

Meyers
RM
,
Bryan
JG
,
McFarland
JM
, et al.
Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells
.
Nat Genet
2017
;
49
:
1779
84
.

37.

Barabasi
AL
,
Oltvai
ZN
.
Network biology: understanding the cell's functional organization
.
Nat Rev Genet
2004
;
5
:
101
13
.

38.

Han
JD
,
Dupuy
D
,
Bertin
N
, et al.
Effect of sampling on topology predictions of protein-protein interaction networks
.
Nat Biotechnol
2005
;
23
:
839
44
.

39.

Seyed-Allaei
H
,
Bianconi
G
,
Marsili
M
.
Scale-free networks with an exponent less than two
.
Phys Rev E Stat Nonlin Soft Matter Phys
2006
;
73
:
046113
.

40.

Lin
CY
,
Lee
TL
,
Chiu
YY
, et al.
Module organization and variance in protein-protein interaction networks
.
Sci Rep
2015
;
5
:
9386
.

41.

Morselli, Gysi
D
,
do
Valle
I
,
Zitnik
M
, et al.
Network medicine framework for identifying drug-repurposing opportunities for COVID-19
.
Proc Natl Acad Sci U S A
2021
;
118
:e2025581118. https://doi.org/10.1073/pnas.2025581118.

42.

Szklarczyk
D
,
Gable
AL
,
Lyon
D
, et al.
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
.
Nucleic Acids Res
2019
;
47
:
D607
13
.

43.

Clark
NR
,
Hu
KS
,
Feldmann
AS
, et al.
The characteristic direction: a geometrical approach to identify differentially expressed genes
.
BMC Bioinformatics
2014
;
15
:
79
.

44.

Jia
P
,
Wang
Q
,
Chen
Q
, et al.
MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis
.
Genome Biol
2014
;
15
:
489
.

45.

Leiserson
MD
,
Wu
HT
,
Vandin
F
, et al.
CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer
.
Genome Biol
2015
;
16
:
160
.

46.

Ryslik
GA
,
Cheng
Y
,
Cheung
KH
, et al.
Utilizing protein structure to identify non-random somatic mutations
.
BMC Bioinformatics
2013
;
14
:
190
.

47.

Ciriello
G
,
Cerami
E
,
Sander
C
, et al.
Mutual exclusivity analysis identifies oncogenic network modules
.
Genome Res
2012
;
22
:
398
406
.

48.

Lawrence
MS
,
Stojanov
P
,
Polak
P
, et al.
Mutational heterogeneity in cancer and the search for new cancer-associated genes
.
Nature
2013
;
499
:
214
8
.

49.

Tamborero
D
,
Gonzalez-Perez
A
,
Lopez-Bigas
N
.
OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes
.
Bioinformatics
2013
;
29
:
2238
44
.

50.

Cerami
E
,
Demir
E
,
Schultz
N
, et al.
Automated network analysis identifies core pathways in glioblastoma
.
PLoS One
2010
;
5
:
e8918
.

51.

Han
Y
,
Yang
J
,
Qian
X
, et al.
DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies
.
Nucleic Acids Res
2019
;
47
:
e45
.

52.

Vandin
F
,
Upfal
E
,
Raphael
BJ
.
De novo discovery of mutated driver pathways in cancer
.
Genome Res
2012
;
22
:
375
85
.

53.

Fortin
JP
,
Hansen
KD
.
Reconstructing a/B compartments as revealed by hi-C using long-range correlations in epigenetic data
.
Genome Biol
2015
;
16
:
180
.

54.

Hou
JP
,
Ma
J
.
DawnRank: discovering personalized driver genes in cancer
.
Genome Med
2014
;
6
:
56
.

55.

Bashashati
A
,
Haffari
G
,
Ding
J
, et al.
DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer
.
Genome Biol
2012
;
13
:
R124
.

56.

Porta-Pardo
E
,
Godzik
A
.
E-driver: a novel method to identify protein regions driving cancer
.
Bioinformatics
2014
;
30
:
3109
14
.

57.

Stratton
MR
,
Campbell
PJ
,
Futreal
PA
.
The cancer genome
.
Nature
2009
;
458
:
719
24
.

58.

Zhang
Y
,
Chang
X
,
Xia
J
, et al.
Identifying network biomarkers of cancer by sample-specific differential network
.
BMC Bioinformatics
2022
;
23
:
230
.

59.

Zotenko
E
,
Mestre
J
,
O'Leary
DP
, et al.
Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality
.
PLoS Comput Biol
2008
;
4
:
e1000140
.

60.

Goh
KI
,
Cusick
ME
,
Valle
D
, et al.
The human disease network
.
Proc Natl Acad Sci U S A
2007
;
104
:
8685
90
.

61.

Hahn
MW
,
Kern
AD
.
Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks
.
Mol Biol Evol
2005
;
22
:
803
6
.

62.

Patel
K
,
Doudican
NA
,
Schiff
PB
, et al.
Albendazole sensitizes cancer cells to ionizing radiation
.
Radiat Oncol
2011
;
6
:
160
.

63.

Ghasemi
F
,
Black
M
,
Vizeacoumar
F
, et al.
Repurposing Albendazole: new potential as a chemotherapeutic agent with preferential activity against HPV-negative head and neck squamous cell cancer
.
Oncotarget
2017
;
8
:
71512
9
.

64.

Koelblinger
P
,
Thuerigen
O
,
Dummer
R
.
Development of encorafenib for BRAF-mutated advanced melanoma
.
Curr Opin Oncol
2018
;
30
:
125
33
.

65.

Lin
CY
,
Ruan
P
,
Li
R
, et al.
Deep learning with evolutionary and genomic profiles for identifying cancer subtypes
.
J Bioinform Comput Biol
2019
;
17
:
1940005
.

66.

Wold
S
,
Esbensen
K
,
Geladi
P
.
Principal component analysis
.
Chemom Intell Lab
1987
;
2
:
37
52
.

67.

van der
Maaten
L
,
Hinton
G
.
Visualizing data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
605
.

68.

Cancer Genome Atlas N
.
Comprehensive molecular characterization of human colon and rectal cancer
.
Nature
2012
;
487
:
330
7
.

69.

Jiang
J
.
The `dark matter' in the plant genomes: non-coding and unannotated DNA sequences associated with open chromatin
.
Curr Opin Plant Biol
2015
;
24
:
17
23
.

70.

Cleves
PA
,
Shumaker
A
,
Lee
J
, et al.
Unknown to known: advancing knowledge of coral gene function
.
Trends Genet
2020
;
36
:
93
104
.

71.

Ebbert
MTW
,
Jensen
TD
,
Jansen-West
K
, et al.
Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight
.
Genome Biol
2019
;
20
:
97
.

72.

Chen
M
,
Sheng
XJ
,
Qin
YY
, et al.
TBC1D8 amplification drives tumorigenesis through metabolism reprogramming in ovarian cancer
.
Theranostics
2019
;
9
:
676
90
.

73.

Baine
MJ
,
Chakraborty
S
,
Smith
LM
, et al.
Transcriptional profiling of peripheral blood mononuclear cells in pancreatic cancer patients identifies novel genes with potential diagnostic utility
.
PLoS One
2011
;
6
:
e17014
.

74.

Guo
WF
,
Yu
X
,
Shi
QQ
, et al.
Performance assessment of sample-specific network control methods for bulk and single-cell biological data analysis
.
PLoS Comput Biol
2021
;
17
:
e1008962
.

75.

Liu
YY
,
Slotine
JJ
,
Barabasi
AL
.
Controllability of complex networks
.
Nature
2011
;
473
:
167
73
.

76.

Zanudo
JGT
,
Yang
G
,
Albert
R
.
Structure-based control of complex networks with nonlinear dynamics
.
Proc Natl Acad Sci U S A
2017
;
114
:
7234
9
.

77.

Nacher
JC
,
Akutsu
T
.
Dominating scale-free networks with variable scaling exponent: heterogeneous networks are not difficult to control
.
New J Phys
2012
;
14
:
073005
.

78.

Li
R
,
Lin
CY
,
Guo
WF
, et al.
Weighted minimum feedback vertex sets and implementation in human cancer genes detection
.
BMC Bioinformatics
2021
;
22
:
143
.

79.

Mani
R
,
St Onge
RP
,
Hartman
JLT
, et al.
Defining genetic interaction
.
Proc Natl Acad Sci U S A
2008
;
105
:
3461
6
.

80.

Pires
DE
,
Ascher
DB
,
Blundell
TL
.
mCSM: predicting the effects of mutations in proteins using graph-based signatures
.
Bioinformatics
2014
;
30
:
335
42
.

81.

Duan
G
,
Walther
D
.
The roles of post-translational modifications in the context of protein interaction networks
.
PLoS Comput Biol
2015
;
11
:
e1004049
.

82.

Huang
Y
,
Chang
X
,
Zhang
Y
, et al.
Disease characterization using a partial correlation-based sample-specific network
.
Brief Bioinform
2021
;
22
:bbaa062. https://doi.org/10.1093/bib/bbaa062.

83.

Navin
N
,
Kendall
J
,
Troge
J
, et al.
Tumour evolution inferred by single-cell sequencing
.
Nature
2011
;
472
:
90
4
.

84.

Shapiro
E
,
Biezuner
T
,
Linnarsson
S
.
Single-cell sequencing-based technologies will revolutionize whole-organism science
.
Nat Rev Genet
2013
;
14
:
618
30
.

85.

Chu
LF
,
Leng
N
,
Zhang
J
, et al.
Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm
.
Genome Biol
2016
;
17
:
173
.

86.

Yan
L
,
Yang
M
,
Guo
H
, et al.
Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells
.
Nat Struct Mol Biol
2013
;
20
:
1131
9
.

Author notes

Hsin-Hua Chen, Chun-Wei Hsueh, Chia-Hwa Lee, and Ting-Yi Hao contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]