A nonparametric significance test for sampled networks

Abstract Motivation Our work is motivated by an interest in constructing a protein–protein interaction network that captures key features associated with Parkinson’s disease. While there is an abundance of subnetwork construction methods available, it is often far from obvious which subnetwork is the most suitable starting point for further investigation. Results We provide a method to assess whether a subnetwork constructed from a seed list (a list of nodes known to be important in the area of interest) differs significantly from a randomly generated subnetwork. The proposed method uses a Monte Carlo approach. As different seed lists can give rise to the same subnetwork, we control for redundancy by constructing a minimal seed list as the starting point for the significance test. The null model is based on random seed lists of the same length as a minimum seed list that generates the subnetwork; in this random seed list the nodes have (approximately) the same degree distribution as the nodes in the minimum seed list. We use this null model to select subnetworks which deviate significantly from random on an appropriate set of statistics and might capture useful information for a real world protein–protein interaction network. Availability and implementation The software used in this paper are available for download at https://sites.google.com/site/elliottande/. The software is written in Python and uses the NetworkX library. 
Supplementary information
 Supplementary data are available at Bioinformatics online.


No.
Gene Name Mapped BioGrid Ids 1 LGALS3 110149  2  SOX4  112542  3  TWIST1  113142  4  PGK1  111251  5  NOT IDENTIFIED  NONE  6  SNRPB2  112513  7  CLK3  107609  8  CDKN1A  107460, 111099  9  NFE2L1  110851  10  ASNS  106932  11  ID3  109625  12  ID1  109623 Table 1 from Ref [2] into gene names and the subsequent conversion into a BioGRID identifier using the BioGRIDs internal conversions. The ordering is the same in each table. Proteins marked with a * are present in BioGRID but are not part of the largest connected component of the network composed of interactions from Yeast 2 Hybrid.

Derivation Of Analytic Results
The analytic results shown here have been extended from results in Ref. [4] and follow some of the notation. Note, Ref. [4] focuses on the classic problem of an unknown network with observed samples, in this case we have the related problem where we know that the network but we want to explore features of the samples. We define the function h(J, s) as the probability that there does not exist a node within n hops of J on a randomly selected seed list of size s. Let B n (J) be the set of nodes within n hops of the nodes in J. If the selection is uniformly at random from all subsets of nodes of size s then we can use a hypergeometric distribution to derive an expression for h(J, s). We can do so as follows, we randomly select s seeds from |V |, however if we select a seed from B n (J) then at least one node in J will be included. Thus we require the seeds to all be chosen from V \ B n (J). Thus through the hypergeometric distribution this results in the following form for h(J, s): If we wish to fix the degree sequence of the seeds (or with a small adjustment binned degree), we can use a similar approach to the uniform case, to derive an expression for a degree sequence version, h d (J, t) where t is the degree sequence. This results in the following form for h(J, t): where t is the degree sequence of the seed list. Here F (t, u) is a counting function, it counts how many instances of u there are in t (e.g. F ([1, 2, 3, 4, 1], 1) = 2), U (t) returns the unique elements of l and D(J, r) is the number of elements in J of degree r. The seeds of different degrees are selected independently so we can calculate the probability for each unique degree in the seed degree list and then multiply them to get the final probability.
Deriving the Mean and Variance Following the notation in the paper we let X be a random variable denoting the number of nodes in a snowball sampled graph with seed list S, where S is a uniform random draw over all possible seed lists. For notational convenience we define |S| as the number of seeds in S. We are interested in the case where |S| = s, thus we will constrain our calculations to this case. Note, by further restricting S we obtain other schemes for example enforcing the degree sequence. In the case of enforcing the degree sequence we can repeat many of the following arguments by replacing h with h d . If we let Y i be an indicator variable for the presence of node i in the sample, then the number of nodes in a sample is X = i Y i . We can compute the mean number of nodes as follows: To compute the variance we can use the correlated variables formula to get: We can imagine each term in the summation (Y i |S| = s) as a Bernoulli random variable with The covariance between the two variables is defined as:  thus: Combining and simplifying all of the terms we obtain: We can then factor the expression to obtain the form in the paper: where L is a dummy summation variable. Figure 1 illustrates the effect of seed list size on the distribution of the number of nodes in a 1-hop snowball sample in the BioGRID PPI network [1,7]. As we do not have a seed list of interest in this case, we have assumed that there is no restriction on the degree sequence of the seed nodes.
As expected, the larger the number of seed nodes, the larger the average number of nodes in the resulting network. Further, a small change in the number of seed nodes can have a large impact on the expected size of the network. The number of nodes in a 1-hop snowball sample on, say 20 proteins from the PPI network may appear small when compared to subnetworks randomly generated from 30 seeds but large when compared to such subnetwork generated from 10 seeds.

Algorithms To Add Or Remove Redundant Seeds
To discover redundant seed nodes we need to be able to guarantee that a pair of labelled networks are identical. The trivial way to do this is to compare the node lists and the edge lists, and if they are equal then the networks are equal. As we will have to check equality a large number of times, this approach can can prove computationally constraining. Making use of the fact that we are adding or removing nodes from the seed list, we can derive a simpler condition for this problem.
We take an arbitrary network with node set V and edge list E and a seed list where f arb is a arbitrary network sampling function, l 1 is a seed list and V l1 and E l1 are the subset of nodes and edges that are in the subnetwork.
Let us assume that we have two seed lists l 1 and l 2 and l 1 ⊆ l 2 . Further, let us assume that our sampling Trivially for sets T 1 and . Therefore we can use the condition: Note, we must condition both on the set of edges and the set of nodes as they are both required to fully define the subnetwork. Therefore if we can guarantee using our sampling techniques that if l 1 ⊆ l 2 then V l1 ⊆ V l2 and E l1 ⊆ E l2 then we can simply test for equality in the number of nodes and edges.
Snowball Sampling In snowball sampling the contribution from each of the nodes on the seed list is independent, as it is simply the number of nodes within a certain radius of the each of the seeds. Therefore the expression for the V l1 is as follows: If we take l 1 ⊆ l 2 , then V l2 = V l1 ∪ V l2\l1 and therefore trivially V l1 ⊆ V l2 . We can use the same argument for E l1 . We can therefore use the condition in (1) for snowball sampled networks given that the seed list is a subset or a superset of the original seed list.
Deterministic Path Based Sampling Techniques We define a deterministic path based sampling techniques in undirected networks as a procedure for which the sample as is function of every pair of nodes on the seed list and the underlying network. Note that Path≤k and shortest path sampling both fall into this category. Therefore, arb (y, z) and P (E) arb (y, z) are sets of the sampled nodes and edges respectively on the path(s) between y and z defined by the sampling technique in question. Let l 1 and l 2 be seed lists and let l 1 ⊆ l 2 , then where the right hand side is the union of the nodes included by the seed list l 1 (first term), the nodes that are included by the seed list l 2 \ l 1 (second term), and the third and fourth terms represent the contributions from paths which connect a node from l 1 to a node from l 2 \ l 1 . Therefore V l1 ⊆ V l2 , and by similar argument E l1 ⊆ E l2 . Thus we can use the condition in (1) for all deterministic path sampling techniques.

Algorithm To Add Additional Redundant Seed Nodes
In the paper given a seed list, a network, a sampling technique and the resultant subnetwork, we construct the largest seed list that will generate the same subnetwork. For a given seed list and network the procedure to do this is as follows: 1. Make an empty list, which will contain the possible seed nodes.
2. For each node in the subnetwork, compute if it can be added to the seed list without increasing the number of nodes or edges in the subnetwork. If so add to the list of possible seed nodes.
4. Form all seed lists with the original seeds and all but W of the nodes on the possible seed list.
5. Test if each of these seed lists produces the same subnetwork. If at least one seed list produces the subnetwork return all tested seed lists that produce the same subnetwork, else let W = W + 1 and go back to step 4.
For all results in the paper we used the algorithm above, however for some sampling techniques we can construct a simplified procedure. For n-hop snowball sampling, we can construct a list of possible seed nodes using the following procedure: listofSeeds=[] OutsideNodes=list of nodes in (n+1)-hop snowball sample but not in n-hop sample for curNode in Subnetwork if a node in OutsideNodes is in the n-hop snowball sample of curNode: continue else: add curNode to listofSeeds

Optimising Finding Redundant Seed Nodes
The algorithm presented in Section 2.4 of the paper is the following: 1. Remove each seed in turn and check if the number of nodes and edges in the subnetwork do not change.
If not, then add the node to the list of redundant seeds.
2. Form a list of the remaining seeds. 6. Return the smallest seed list(s) that produce the same network.
The major problem in this procedure is the large number of options that may need to be checked to find the minimum seed list. As stated in the paper finding the minimum seed list for snowball sampled networks can be converted into the set cover problem which is NP-hard [3].
The set cover problem is defined as follows, for a set F, and a collection of subsets F = {F 1 , ..., F m } such that F = x∈F x [3]. The problem is then to find the smallest subset of F which we shall call F * such that We can reformulate the minimum seed list for Snowball sampled networks as follows. We let F be an empty set. For each seed node we compute the set of nodes which are sampled by this seed node and we add it to F . Finding the minimum seed list is then equivalent to finding F * and then returning the seeds which were used to construct each element of F * .
Thus in cases where we require additional speed we could try reformulating this as a set cover problem and use state of the art algorithms for this problem.
It may also be possible to convert the path based techniques into another related NP Complete or NP Hard problem and use a similar technique. However, as we do not require the speed for the work we are doing here we have not attempted to do so.
Further Optimisations A further optimisation, which is sampling technique dependent, can be performed on sampling techniques that scale with number of nodes in the whole network and that only depend on the information in the subnetwork. Sampling in the subnetwork rather than in the wider network can be more efficient while still guaranteeing the result. One example of this procedure is shortest path sampling. All of the information about shortest paths is included in the subnetwork. Therefore sampling with the reduced seed list in the subnetwork saves time (as shortest path scales with number of nodes and edges depending on implementation) and guarantees that the result is correct as long as the seed list is a subset of the original seed list.

Further Results: Adding Redundant Seed Results
We showed that redundant seed nodes have to be taken into account; in particular in Figure 4 of the paper we demonstrated that the significance of randomly chosen seed lists can be changed in the BioGRID network under 2-hop snowball sampling by increasing the size of the seed list without changing the resultant sampled network. A similar effect can be observed several other sampling techniques as can be seen in Figure 2.

Empirical Seed Lists: Additional Results
To test whether the results we see in the paper are robust with respect to the bins that we use to generate the random seed lists, we recalculate the results in the paper using a minimum bin size of 5, 10, 20, 30 and 50. The smaller the bin size the closer the degree sequence will match the test sequence, whereas the larger minimum bin sizes produce seed lists which have degree sequences which are further from the test list, but have a lower likelihood of selecting the same small set of seed nodes. Figure 3 shows the results for the OMIM seed list and Figure 4 shows the results for the expression seed list.

Comparison With Configuration Model
The configuration model may not preserve the structural features of the original network. For example in the 2-hop snowball sample in Figure 5 there is a very clear structure, with a maximum path length of 4 between any pair of nodes. This structure will not be preserved in the configuration model Fig. 6 shows a simple comparison between the distribution of shortest path length in the snowball sampled network compared against an ensemble of configuration models of the same network.
In the paper we examine the p-value distribution using our null model and the configuration model in 2-hop snowball sampled subnetworks. We repeated the comparison with all of the other sampling techniques using the method described in the main paper. The results can be seen in Figure 7. We use a χ 2 test to compare the distributions with the uniform distribution taking as observations the p-values of the statistic of interest from 1000 networks generated by selecting 25 random seeds (Table 3). We see that in all sampling techniques we reject the null hypothesis for the configuration model and for Snow1, Snow2 and all shortest paths we do not reject the null hypothesis for our null model. For Path3 in clustering we reject the null hypothesis at the 5% level but not at the 1% and further visual inspection of the distribution also does not draw any concern.
In the case of the clustering in Path2, triangles can occur only when two seed nodes are less than or equal to 2 hops apart and one of them is part of a triangle with 8, 292 nodes and an average shortest path length of 4.35 this is unlikely to happen. For non-continuous distributions, rather than the uniform distribution, we would expect to see the generalised inverse of the cumulative density function as distribution of the p-values. Fig. 8 shows that the distribution of average local clustering coefficients for Path2 networks sampled with 25 random seeds is indeed discontinuous and therefore we should not be surprised to see a non-uniform null distribution. In contrast with the same distribution from Snow1 shown in Fig. 9 is approximately continuously distributed and therefore the uniform distribution appears as the null distribution as expected.
While we cannot generalise from these results to all possible networks ensembles, and it is highly likely that there are network models and parameters ranges where the configuration model performs well in subnetworks, the configuration model does not perform well in general when comparing subnetworks based on seed lists. This demonstrates the need for an alternative to the configuration model for this task.    The seed list consists of node 1 (circle), node shape represents distance from seed protein square, representing nodes 1 hop from the seed, diamond 2 hops from a seed, triangle 3 hops from a seed. Dashed edges represent cross-edges in a 2-hop snowball sample.