USNAP: fast unique dense region detection and its application to lung cancer

Abstract Motivation Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this article, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains. Results There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer gene expression datasets. Stages in non-small cell lung cancer were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m+mc log nc+nc log nc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph. Availability and implementation The code of USNAP is available at https://www.cs.utoronto.ca/∼juris/data/USNAP22.

Algorithm 1 gives a top level description of USNAP.USNAP first creates a collapsed graph, G c .USNAP then starts with the entire collapsed graph, and finds one usnap in each iteration.USNAP then removes the discovered usnap from G c , and continues to search for another usnap.
Note that a subgraph that satisfies the exclusive threshold may not exist for any given graph.Furthermore, it is possible that USNAP is not able to find any usnap given its greedy nature.Thus, a restart option is provided in lines 5-6.Restarts allow for possible different orderings in the removal of vertices which may lead to finding usnaps.The removal of a vertex will affect its neighbors, which in turn may affect which vertex is to be removed first.
To increase applicability of USNAP, we provide a few tunable parameters.Some applications may desire the overlapping of vertices between usnaps while others may desire not to have any overlaps.Line 9 provides an option for the overlapping of vertices between usnaps.p specifies the fraction of vertices in the identified usnap that should remain for the searching of other usnaps.Algorithm 2 creates a collapsed graph, G c , from G. Hash edgeH contains an edge as key, and the count of the number of snapshots having a given edge as value.edgeU niqueH keeps track of the edges in G u .The weight in line 12 is computed using the weight function specified in the manuscript.
) for each vertex in G c .Only the removed vertices, maximum density, and indices for configurations are stored.Note again that not all graphs will have a usnap.Moreover, it is possible that no subgraph that satisfies the exclusive threshold can be found due to the greedy nature of USNAP.
Algorithm 4 removes a vertex according to Lemma 1, and performs the necessary updates.Line 1 performs the delete minimum operation in M H.It is important that the computation of the value to be updated for u be in constant time in order for line 4 to execute in O(logn c ).

Lemmas
We consider the following 2 cases: du(v)min d(v)min = 0 and du(v)min d(v)min > 0. Note that du(v) d(v) ∈ [0, 1], and d(v) cannot be 0 as there is no vertex with no edge in G c .The first case corresponds to when vertex v has the minimum weighted degree in G c and it doesn't have any u specific edge.The second case corresponds to when v has the minimum weighted degree in G c and it does have u specific edge(s).
If du(v)min d(v)min = 0, then USNAP will pick v to be removed as: In this case, the removal of v results in which is the highest possible value for density in G ′ c .This is because no matter what mass you subtract from the previous mass, the denominator will be the same, |V |−1.This corresponds to the case when there is no u specific edge that gets removed when v having d w (v) min is removed.This would be the best case.This proves the left side of the lemma, mass(Gc)−dw(v)min and we know that d w (v) min < d w (b) by definition.Thus, we have To get an upper bound for d w (b), α would be 1 and β would be 0. Therefore, In this case, the removal of b will result in .
Lemma 1 proves that in each step of the greedy algorithm, density(G ′ c ) would not be worse than mass(Gc)−2dw(v)min . The function that is used to select which v to remove in V (G c ) in lemma 1 has two aspects: d w (v) is to have dense subgraphs, and (1 + du(v)  d(v) ) is to discourage the removal of nodes that have u specific edge(s).The more u specific edges v has, the greater (1 + du(v)  d(v) ) will be. 1 is added to du(v)  d(v) in order to prevent the return of a tie score, 0, if i, j ∈ V (G c ), d w (i) > d w (j), and both don't have any u specific edge.This is because if the function is (d w (v) du(v)  d(v) ), then when d u = 0, the score will be 0.This would be problematic if d w (i) >> d w (j) and USNAP removes i instead of j since they both have a score of 0. In FindARegion(G c , γ), line 2 takes O(n c ) to initialize M H. Lines 5, 6, 9 take constant time as variables are updated, and are not computed from scratch.The bottleneck for FindARegion(G c , γ) is in line 4, RemoveNode(G c , M H).
Line 1 of RemoveNode(G c , M H) takes O(logn c ) as it performs the delete minimum operation in M H. Lines 1, 2 of RemoveNode(G c , M H) execute n c times as each vertex is to be removed once.Thus, lines 1 − 2 take O(n c logn c ). Lines 3 − 5 in RemoveNode(G c , M H) execute m c times as each edge is to be updated once, and line 4 takes O(logn c ) as it performs the decrease key operation in M H. Thus, lines 3 − 5 take O(m c logn c ).Therefore, the time complexity for FindARegion Since m c ≤ m and n c ≤ n, the time complexity of USNAP in terms of the number of edges and number of vertices in the input set of T snapshot graphs is O(mlogn + nlogn).Expressing the time complexity with m c , n c provides a tighter bound for USNAP.
Theorem 1.The removal of a vertex in each iteration according to Lemma 1 gives a 4 approximation for density(G c ).
Proof.We obtain Theorem 1 by putting Lemma 3 and 4 together.This proof is adopted from Charikar's proof for his greedy approximation algorithm (Charikar [2000]).We first give an upper bound for the optimal solution.We assign each weighted edge i j ∈ E(G c ) to either vertex i or j.For each vertex i ∈ V (G c ), d a (i) denotes the sum of the edge weights for those edges that are assigned to i.
Now, let's consider the edge assignment for the greedy algorithm.All edges are unassigned initially.When the vertex with the minimum value according to Lemma 1 is deleted from S c , the vertex is assigned all edges that connect from this vertex to the rest of vertices in S c .All edges between any 2 vertices in the current set S c are unassigned, and the rest of the edges are assigned.All edges are assigned at the end of the execution of the greedy algorithm.Proof.Let's consider a single iteration.Let i min be the vertex that is to be deleted, and i min is selected according to lemma 1, i.e. i min has the minimum value on |Sc| is the average weighted degree, and thus , over all iterations.Therefore d max a ≤ 4α.
Putting together Lemma 3 and Lemma 4, we obtained Theorem 1.We designed the removal of vertex as in Lemma 1 in order to find not only dense subgraphs, but unique dense subgraphs.The bound in Theorem 1 has not accounted for satisfying the exclusive threshold.However, from the validation of USNAP on real world datasets, we have shown that USNAP works well in practice.

Input
2.1 Datasets and data sources 4 NSCLC microarray gene expression datasets (Chitale et al. [2009], Okayama et al. [2012], Raponi et al. [2006]) obtained from the Gene Expression Omnibus database (Edgar et al. [2002]), and the Memorial Sloan-Kettering Cancer Center were used to demonstrate the effectiveness of USNAP.The number of stages, the number of samples, and the platform used were criteria for dataset selection.In this paper, we refer to the 4 datasets as ChitaleMA1, ChitaleMA2 (both were from http://cbio.mskcc.org/public/lung_array_data/), Okayama (GSE31210) and Raponi (GSE4573).Protein-protein interaction data used was from Integrated Interactions Database (IID) version 2018-11 (Kotlyar et al. [2019]).Refer to Table 1 for more information regarding the data used.

Construction of the dynamic graphs
We generated the dynamic graphs for each dataset using the following approach, for each stage in each dataset: • pairwise Pearson correlations were calculated for all gene pairs • edges were ranked based on their absolute correlation values • gene pairs with the top 1% of the absolute values were selected • the selected correlated gene pairs were only considered if they had an overlap with protein-protein interactions.

Results
Tables 2 − 5 in the supplementary material show the top 10 (or less if there were fewer results) densest usnaps having more than 2 nodes that are specific to stage 1A, stage 1B, stage 2 or 2B and stage 3A for each dataset respectively.The tables are in sorted order of density with d0 being the densest usnap for each stage and for each dataset.There could be ties for the 10th place but only one of them is included.does not contain a particular stage.For example, the dynamic graph that represents ChitaleMA2 does not have stage 2 or 2B (refer to Table 1), and thus, NAs are placed in the #Edge in 2/2B columns for ChitaleMA2.
Importantly, for all usnaps from all 4 datasets, the edges that are specific to a stage do not appear in any other stages in the same dataset.For example, usnap d0 in Raponi has a clique with 15 edges that are specific to stage 2B (depicted in closed circles in Fig. 1).Out of these 15 edges, no edge is present in stage 1A, 1B or 3A of Raponi.
Moreover, USNAP returns usnaps that are specific induced subgraphs for a given stage in the dataset; that is, all edges among the subgraph's nodes in the specific stage will be returned.For example, usnap d1 that is specific to stage 1A in chitaleMA1 returns 18 edges among 7 nodes.This means that in stage 1A of chitaleMA1, there are only these 18 edges among these 7 nodes, and stage 1B, 2B and 3A in chitaleMA1 do not have any of these 18 edges among these 7 nodes.

V
(G c ) ← V (G c ) \ U where v ∈ U are arbitrary vertices in result and |U | = (1 − p) * #vertex in result 10 end 11 return ResultSet;

Lemma 2 .
USNAP has a time complexity of O(m + m c logn c + n c logn c ). Proof.USNAP creates a collapsed graph from the input set of T snapshot graphs which takes O(m).In CreateCollapseGraph(G, u), the loop in lines 2 − 7 executes m times.The loop in lines 9 − 13 executes m c times as |E(G u )| = |E(G c )|, and compute weights in line 12 takes constant time.Thus, USNAP 's line 2 takes O(m).The call to FindARegion(G c , γ) in line 4 takes O(m c logn c + n c logn c ), and FindARegion(G c , γ) is the bottleneck in the while loop for USNAP.

Lemma 4 .
Let α be the maximum value of density(S c ) over all sets of S c obtained from removing one vertex at a time in each iteration according to Lemma 1. Then d max a ≤ 4α.

Figure 1 :
Figure 1: USNAP detects unique dense subgraphs in the Raponi dataset.Usnap d0 subgraph specific to stage 1A (d0 − 1A) is shown as open circles, and usnap d0 subgraph specific to stage 2B (d0 − 2B) is shown as closed circles.d0 − 1A is only present in 1A, and not in any other stages; d0 − 2B is only present in 2B, and not in any other stages (see the corresponding d0 − 1A, d0 − 2B in other disease stages.)d0 − 1A is composed of genes implicated in asthma, and asthma has been implicated as a risk factor for lung cancer.

Table 1 :
Input dynamic graphs to USNAP All results are available at https://www.cs.utoronto.ca/ ~juris/data/USNAP22. #Edge in 2/2B is used as a column heading because ChitaleMA1 and Raponi have stage 2B data, and Okayama has stage 2 data.NAs do not mean that USNAP is unable to return results, but rather, NAs indicate that a given dataset

Table 2 :
USNAP 's results specific to stage 1A

Table 3 :
USNAP 's results specific to stage 1B

Table 4 :
USNAP 's results specific to stage 2 or 2B

Table 9 :
Pathway specific to stage 3A