Dimension reduction for covariates in network data

A problem of major interest in network data analysis is to explain the strength of connections using context information. To achieve this, we introduce a novel approach, called network-supervised dimension reduction, in which covariates are projected onto low-dimensional spaces to reveal the linkage pattern without assuming a model. We propose a new loss function for estimating the parameters in the resulting linear projection, based on the notion that closer prox-imity in the low-dimension projection corresponds to stronger connections. Interestingly, the convergence rate of our estimator is found to depend on a network effect factor, which is the smallest number that can partition a graph in a manner similar to the graph colouring problem. Our method has interesting connections to principal component analysis and linear discriminant analysis, which we exploit for clustering and community detection. The proposed approach is further illustrated by numerical experiments and analysis of a pulsar candidates dataset from astronomy.


Introduction
Network data that include multiple objects with measurements on the interactions between pairs of objects are becoming increasingly common in a wide variety of fields (Holland & Leinhardt, 1981;Wolfe, 1997;Jin et al., 2001;Newman et al., 2002;Watts et al., 2002;Newman & Park, 2003;Sarkar & Moore, 2005;Newman, 2006;Hunter et al., 2008;Kolaczyk, 2009;Goldenberg et al., 2010;Fienberg, 2012;Scott, 2017). The topology of a network is often represented as a graph, denoted by G = (V, E), where V = {1, 2, . . . , n} is the set of n nodes and E is the set of edges between the nodes. The relationships between nodes can be described by an adjacency matrix W = (w ij ) ∈ R n×n , where w ij is some measure of the connection strength between nodes c The Author(s) 2021. Published by Oxford University Press on behalf of the Biometrika Trust. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
such that the smaller B T X ij is the smaller s ij is. To this end, we propose a novel estimator of B based on a new loss function and study its rate of convergence for approximating the columns of B in terms of 2 distance. This is achieved without imposing a restrictive independence assumption on the w ij , or needing to assume a link function between B T X ij and s ij . We show that the convergence rate of the projection depends critically on a factor referred to as the network effect of a graph, which is closely related to the graph colouring problem. Proposing such an estimator and characterizing its properties can be seen as the first contribution of this work. Our second contribution is to establish a natural connection between our method and existing methods such as principal component analysis and linear discriminant analysis. The connection to the latter enables us to leverage covariate information for better community detection, which we illustrate via simulations showing that a clustering algorithm based on network-supervised dimension reduction outperforms the competitors.
The following notation is used throughout this paper. For any matrix A = (a ij ) ∈ R p×p , A op and A F denote its operator norm and Frobenius norm, respectively, and A max = max i, j |a ij |. Let {a ij } be a set whose items are all in A. For any symmetric matrix A, λ max (A) and λ min (A) stand for the maximum and minimum eigenvalues of A, respectively, and tr(A) denotes the trace of A. For a vector v ∈ R p , v denotes its 2 -norm. For any variable Z ∈ R, define Z ψ 2 = sup p 1 p −1/2 {E(|Z| p )} 1/p , and for any Z ∈ R p , define Z ψ 2 = sup x∈S p−1 Z, x ψ 2 , where S p−1 is the unit sphere in R p . We use I n to denote the n×n identity matrix. For any set V , |V | denotes its cardinality. For any matrix B, we denote by span(B) the space spanned by the columns of B and let P B be the projection matrix onto the space span(B). We write a ∧ b = min{a, b}.

Network-supervised dimension reduction
2.1. Notation and background Recall that our data consist of network-covariate tuples {s ij , X ij } (i | = j). Our goal is to find B ∈ R p×r such that a small value of B T X ij corresponds to a small value of s ij . We refer to B as the projection matrix and its columns as the projection directions. To partially ensure identifiability of B, we constrain B ∈ r, A , where r, A ⊂ R p×r satisfies r, A = {B ∈ R p×r : B T AB = I r } for a symmetric positive-definite matrix A ∈ R p×p with eigenvalues uniformly bounded away from 0 and ∞. An obvious example is A = I p . Since a small value of B T X ij corresponds to a small value of s ij , our proposed network-supervised dimension reduction estimates B aŝ B r, A = (β A, 1 , . . . ,β A, r ) = arg max B∈ r, A H (B), where H (B) = {n(n − 1)} −1 i | =j s ij X T ij B 2 = tr(B TĜ B) witĥ This optimization problem for estimating the projection directions requires only a standard eigenvalue decomposition, as shown in the following proposition.
Proposition 1. Suppose that all the eigenvalues of A −1/2Ĝ A 1/2 are distinct. Letˆ r be the matrix consisting of the eigenvectors associated with the r largest eigenvalues of A −1/2Ĝ A −1/2 . Then span(B r, A ) = span(A −1/2ˆ r ).
While span(B r, A ) is unique, butB r, A is not, Proposition 1 suggests that we can takeB r, A = A −1/2ˆ r . We next provide analogous results at the population level. Let G 0n = E(Ĝ) be the expectation ofĜ, which may depend on the size of the network n, and assume that G 0 = lim n G 0n for some G 0 ∈ R p×p . When the Z ij have the same distribution, but are not necessarily independent, it can be seen that which is the population version ofB r, A . As in Proposition 1, assuming that the eigenvalues of A −1/2 G 0 A −1/2 are distinct, if we write r for the matrix consisting of the eigenvectors associated with the r largest eigenvalues of A −1/2 G 0 A −1/2 , then we also have span(B r, A ) = span(A −1/2 r ). By an argument similar to that used forB r, A , we simply set B r, A = A −1/2 r . We now state sufficient conditions which guarantee that the population maximizer of H (B) spans the same column space as spanned by the true projection directions. Letting the matrix A in r, A be A = E(X ij X T ij ), which equals cov(X ij ) when E(X ij ) = 0, we have the following result.
Proposition 2. Suppose that {(s ij , X ij ), i | = j} are identically distributed. Assume that the following conditions hold: This proposition requires that the conditional mean of s ij depend on X ij only through the linear combination B T 0 X ij , with an unknown link function h that is left unspecified. This is reminiscent of the assumption in the literature of sufficient dimension reduction (Li, 1991), especially for making inference about the conditional mean of the response given the predictors (Cook & Li, 2002). The key difference is that the responses in our set-up are typically correlated owing to the existence of the network structure. Condition (i) can be seen as the true model. In particular, when s ij ∈ {0, 1}, for example s ij = 1 − w ij , this condition says that pr(s ij = 1) = h(B T 0 X ij ). The estimation procedure in (1) does not provide an estimator of h. As such, our estimation procedure is model-free. To understand the condition (iii), consider the case where the covariates are defined as X ij = X i − X j with X i ∼ N (μ, ). Then this assumption becomes cov{s ij , (X T ij β m ) 2 } > 0, as shown in the Supplementary Material; it is intuitive since we expect a smaller s ij to correspond to a small value of B T 0 X ij and therefore a small value of (X T ij β m ) 2 .

Connections to other methods
In the context of the so-called stochastic block model, we establish connections between our dimension reduction method and principal component analysis as well as linear discriminant analysis. The latter two methods are widely used statistical tools for reducing the dimensionality of data, and both work by finding the best linear combinations of covariates. Principal component analysis is an unsupervised method that projects observations onto the so-called principal component directions in such a way that the variance of the projected data is maximized. Linear discriminant analysis is a supervised learning algorithm that finds the so-called linear discriminant directions for projecting data to maximize the separation between observations belonging to different groups (Johnson & Wichern, 2019).
Recall that for a stochastic block model with k communities, each node belongs to a latent community (Holland et al., 1983). Denote the latent community label of the ith node by C i , where C i ∈ {1, . . . , k} for i = 1, . . . , n. The stochastic block model assumes that these community labels are independent and identically distributed random variables such that pr(C i = t) = π t (t = 1, . . . , k), where the π t are unknown parameters satisfying k t=1 π t = 1. Given their respective communities, nodes i and j make a connection with probability pr(w ij = 1 | C i , C j ) = pr C i C j (i | = j), independently of all other pairs, where pr C i C j is a parameter that depends only on C i and C j . We look at a simplified stochastic block model where pr C i C j = a t for C i = C j = t and pr C i C j = b for any C i | = C j ; that is, all the probabilities of intercommunity connections are the same. For the covariates, we take X ij = X i − X j , where the covariate vector for node i satisfies for independent and identically distributed random variables i with E( i ) = 0 and cov( i ) = .
Here it is assumed that i is independent of C i , the latent community label of node i in the stochastic block model above. That is, the covariates follow a distribution with a common covariance matrix and a community-specific mean. In such a setting, if s ij is a one-to-one mapping of w ij , it is easily seen that E(s ij | C i = t, C j = t ) is a constant, depending on b, for any t | = t , which will be denoted by γ b hereafter. Write E(s ij | C i = C j = t) = γ t (t = 1, . . . , k) for ease of notation. We point out that the w ij in the above model depend only on the labels C i , which differs from the model assumed in condition (i) of Proposition 2.
If we apply principal component analysis to the nodal feature X i , at the population level, the principal component directions are the leading eigenvectors of cov(X i ) corresponding to its largest eigenvalues. If we apply linear discriminant analysis to the labelled data {C i , X i } n i=1 , assuming that the latent community labels are known in model (3), the linear discriminant directions at the population level are the leading k − 1 eigenvectors of the generalized eigenvalue problem that solves bt U = λ U for U ∈ R p×(k−1) , where bt = k −1 k t=1 (μ t −μ)(μ t −μ) T with μ = ( k t=1 μ t )/k and is the covariance matrix of i defined above. We have the following result that connects our approach with principal component analysis and linear discriminant analysis.
Proposition 3. Assume that W = (w ij ) is generated from the simple stochastic block model outlined above and that the X i are generated from model (3). If all the eigenvalues of A −1/2 G 0 A −1/2 are distinct for A as specified below, the following conclusions hold.
(i) With A = I p in A, r , our approach is equivalent to principal component analysis conducted as eigenvalue decomposition of cov(X i ) at the population level in the sense that B r, A consists of exactly the eigenvectors associated with the r largest eigenvalues of cov(X i ) if and only This proposition shows that network-supervised dimension reduction can be equivalent to unsupervised principal component analysis or supervised linear discriminant analysis, depending on the data-generating process. In Proposition 3, it is assumed that pr C i C j = b for any C i | = C j . By checking the proof, one can see that the proposed method may not be equivalent to principal component or linear discriminant analysis in general when this assumption does not hold. In this case, the associated objective function for our method can be viewed as a generalized version of those in principal component analysis and linear discriminant analysis. For community detection, we explain what we mean by further examining the special case of two communities in the situation where k = 2 and s ij is a linear decreasing function of w ij . Recall the definition of B r,A in (2).
Corollary 1. Assume that X ij and W = (w ij ) are generated as in Proposition 3. Let s ij = α 0 − α 1 w ij with α 1 > 0 and let α 0 ∈ R be a linear decreasing function of w ij . Suppose that the eigenvalues of A −1/2 G 0 A −1/2 for A as specified below are distinct. The following conclusions hold.

of network-supervised dimension reduction is equivalent to that of linear discriminant analysis for the model (3) at the population level.
To understand this corollary, assume for simplicity that the two communities are of equal size so that π 1 = π 2 = 1/2. In this case, (i) states the equivalence of the proposed approach and principal component analysis if and only if b = (a 1 + a 2 )/2 < α 0 /α 1 , which simplifies to b = (a 1 + a 2 )/2 when α 0 > α 1 , because b 1. In other words, when b = (a 1 + a 2 )/2, the network information in terms of the adjacency matrix W does not contribute to identification of the projections. This is reasonable, since in that case the probabilities of connections between different communities are not small. When π 1 = π 2 = 1/2, the condition in (ii) becomes (a 1 + a 2 )/2 > b, and (ii) asserts the equivalence of the proposed approach and linear discriminant analysis when A = cov(X ) and the connection probabilities between different communities are small. The assumption (a 1 + a 2 )/2 > b is weaker than strong and weak assortativity (Amini & Levina, 2018), which require min{a 1 , a 2 } > b in the setting above. In (iii), when A = , our approach is equivalent to linear discriminant analysis for any linear decreasing function when α 0 α 1 > 0 and b < 1. When we apply our method to community detection in the Supplementary Material we show that the misclassification error depends on the connection probabilities mainly through the projected directions at population level. Notably, a 1 , a 2 and b satisfying the constraints in (ii) and (iii) can be small, implying that our approach is applicable to sparse networks. Relevant simulation results are presented in § 2.3.
The results in Proposition 3 and Corollary 1 suggest that the choice of A matters in making connections to principal component analysis and linear discriminant analysis. When A is not prespecified, we suggest taking A = cov(X ). In practice, when A is estimated byÂ from data, there will be an error between B r,Â and its population version B r, A . We give a bound on this error in the Supplementary Material.

Application to community detection
Motivated by the covariate model in (3), we use network-supervised dimension reduction for community detection. To proceed, we first estimate the projection directions, denoted byB r, A if A is given orB r,Â if A is estimated asÂ. We can then apply a clustering method based on the projected observationsB T r, A X i orB T r,Â X i (i = 1, . . . , n). For illustration, we apply K-means clustering in the second step. In practice, to check whether our method is applicable, one can examine the scree plot of the cluster algorithm by plotting the ratio of within variance over total variance versus the number of clusters. If there is a clear gap in the plot, it can be inferred that there are community/cluster structures and our method is applicable.
We now present the result of a small numerical experiment to evaluate the performance of the community detection method based on our approach. The data are generated such that W follows the simplified stochastic block model for the edges with two communities as in § 2.2 and X follows the covariate model in (3), with i being a multivariate normal random vector. We set (3). It is understood that as u increases, the data in the two communities are better separated by the covariates. In the stochastic block model, we set π 1 = π 2 = 1/2 and (a 1 , a 2 , b) = δ 0 (0.5, τ , 0.1), where δ 0 and τ are constants in [0, 1]. It is known that a smaller δ 0 gives a sparser network, and a smaller value of τ gives weaker community in the second group. When τ 0.1, the signal of the second group is weak and detecting it is difficult.
By varying the magnitudes of u, δ 0 and τ , we evaluate how the proposed method performs with respect to the informativeness of the covariates, the sparsity of the network, and the strength of the community structure. In particular, we consider the following three scenarios: We examine two choices of A for estimating B r, A . The first is A = cov(X ), which is estimated by the sample covariance matrix of X . The second is A = , the corresponding algorithm and simulation results for which are given in the Supplementary Material. These two choices of A yield similar results. To apply network-supervised dimension reduction, the response variable s ij is taken to be s ij = 1 − w ij and the number of the directions is set to r = 1. Each time, we generate a dataset with n = 100 and repeat the process 100 times. The performance of an approach is evaluated by calculating its clustering errors, defined as the proportions of nodes that are misclassified. The performance of K-means clustering after applying our approach is compared with that of the standard K-means clustering which uses only covariate information, as well as several competing methods that use information from both the network and the covariates, including the methods of Zhang et al. (2016), Binkiewicz et al. (2017) and Yan & Sarkar (2020). Moreover, in the Supplementary Material we consider cases where the network is dense in that δ 0 is relatively large and the signal of the second community is strong, and we also report the performance of the method of Huang & Feng (2020) and the spectral clustering method of Rohe et al. (2011). The clustering errors for these methods are presented in Fig. 1 and in the Supplementary Material.
It is seen that for K-means clustering, the clustering error decreases as u increases. The performance of the methods of Zhang et al. (2016) and Binkiewicz et al. (2017) improves when the network becomes dense, but since detecting the second group is difficult when τ = 0.1, i.e., a 2 = b, both methods perform worse than the method of Yan & Sarkar (2020) and our method, as shown in Fig. 1(c). In addition, we see that the method of Yan & Sarkar (2020) performs better than the methods of Zhang et al. (2016) and Binkiewicz et al. (2017) in most of the cases, but is worse than our method. Overall, clustering based on our approach performs much better than the competing methods, and it is rather insensitive to the parameters u, δ 0 and τ . This implies that our approach can exploit the information in the covariates as well as the network structure.
In particular, when τ = 0.1 it is quite difficult to detect the second group; but with the help of the information from the covariates, K-means clustering based on our approach still estimates the community structure well. To gain more insight, we consider a simple case where r = 1, C i ∈ {1, 2} and i follows a normal distribution, and give an explicit bound on the classification error rate in the Supplementary Material. Finally, we also observe that for a network that has community structure with between-community probabilities dominating the within-community ones or admitting a core-periphery one, the performance of our method can deteriorate.

Asymptotics
We study the statistical properties ofĜ defined in (1) as an estimator of its population version G 0 = lim n E(Ĝ). Because of the network structure, each w ij in the adjacency matrix W may be affected by the other off-diagonal entries of W in complex ways, which poses great challenges for theoretical analysis. We impose assumptions to rule out the cases where all entries of W can be strongly dependent, without explicitly modelling the dependence structure of the edges of the network.
We motivate our assumptions by generalizing an idea for inducing edge dependence that is widely used in the graphon model (Lovász & Szegedy, 2006;Diaconis & Janson, 2008;Bickel & Chen, 2009), and we will follow the notation in Gao et al. (2015). For an undirected graph, the graphon model assumes the edge random variables The sequence {ξ i } consists of independent and identically distributed latent random variables from the uniform distribution on [0, 1], and given {ξ i }, the w ij are independent for i < j. The function f , a bivariate function that is symmetric in its arguments, is called a graphon. In the graphon model, because the ith latent variable ξ i is assumed to be associated with the ith node, two edge random variables w ij and w kl are independent as long as they do not share a common node index.
We now introduce what we call the generalized graphon model, which is useful for characterizing the dependence structure in our set-up. Assume that ξ i (i = 1, . . . , n) and ζ j (j = 1, . . . , n) are independent and identically distributed latent random variables. Let = (ξ 1 , . . . , ξ n ) T ∈ R n . Instead of associating a single element ξ i of with node i as in the graphon model, we associate with node i a subset of for introducing dependence, as well as an independent ζ i for a node-specific effect. Denote the subset for node i by N i = {j : ξ j is associated with node i}. In the graphon model, i ∈ N i . We then assume the edge random variable w ij ∼ Ber(θ ij ), where Here N i is the subvector of with indices in N i . In our construction, we have purposely left unspecified the exact distributions of the random variables {ξ i } and {ζ j }, as well as the functions {f ij }, as we only need this general construction for relating the edge random variables. In the special case of the graphon model, be the set of pairs of nodes such that any two pairs do not share common latent random variables. It is clear by construction that for any {(i, j), (k, t)} ∈ V, w ij is independent of w kt given X ij and X kt . The cardinality of V provides a rough characterization of the dependence structure of a network intuitively and is seen to be bounded as |V| We now present another example, where N i = {i, i + 1} for i < n and N n = {n, 1}; that is, we associate with each node two latent random variables in . If we represent this example by a graph in which the nodes are {1, . . . , n} and an edge exists between nodes i and j if N i ∩ N j | = ∅, then it forms a cycle graph. In this example, it is not difficult to see that |V| = n(n − 5)(n 2 − 9n + 22) = O(n 4 ), which is of the same order as the maximum possible cardinality of |V|.
Next we study Ĝ − G 0 op . Establishing the rate of convergence ofĜ in the operator norm is challenging because of the dependence between the nodes. In the generalized graphon model above, for example, node i is correlated with node j for any j ∈ N i , which will complicate the theoretical analysis. We overcome the dependency challenge by splitting the node pairs into groups such that any two node pairs in the same group are conditionally independent given covariates.
Let {σ (1), . . . , σ (n)} be any permutation of {1, . . . , n}, and let ξ α, ij = s ij (α T X ij ) 2 − E{s ij (α T X ij ) 2 } for any given α ∈ R p satisfying α = 1. Suppose that we split the index pairs [σ (i) = {σ (2i − 1), σ (2i)}, i = 1, . . . , n/2] into m groups G 1 , . . . , G m such that any two pairs σ (i) andσ (j) within the same group satisfy {σ (i),σ (j)} ∈ V. That is, given {X ij }, different ξ α, ij with their (i, j) in the same group are independent; this will be referred to as the conditional independence property hereafter. It is shown in the Supplementary Material that a smaller m is desired as it leads to a tighter upper bound. Finding the smallest m associated with the permutation {σ (1), . . . , σ (n)} is very challenging and can be viewed as a graph colouring problem where interest often lies in finding the chromatic number of the graph, defined as the minimum number of colours required for a vertex colouring scheme in which any two adjacent vertices are coloured differently; see the Supplementary Material for further discussion. Denote this number by m σ and define m net = max {σ (1),...,σ (n)} m σ , which can be loosely seen as the network effect. The asymptotic property of Ĝ − G 0 op is presented in Theorem 1 below.
Moreover, we study the asymptotic properties of the eigenvalues and eigenvectors ofĜ. For this, we write the eigenvalue decompositions of G 0 andĜ as i , respectively, where λ 1 · · · λ p andλ 1 · · · λ p are the eigenvalues, and v i andv i the associated eigenvectors. The eigenvalues and eigenvectors depend on p, but we omit p hereafter for simplicity. Similarly, let · · · φ A p are the eigenvalues, and ϕ A i andφ A i the associated eigenvectors. Recall the definitions of B r, A andB r, A in § 2. By Proposition 1, we see that and that B r, . When A is unknown and estimated aŝ A , we can defineĜÂ andφÂ i analogously and estimate B r, A byB r,Â = (βÂ , 1 , . . . ,βÂ , r ) = A −1/2 (φÂ 1 , . . . ,φÂ r ).
To study the properties ofB r, A andB r,Â , we make the following assumptions.

Assumption 1. (i) For any integer l > 0 and any subset
When X ij = X i − X j where the X i are independent and identically distributed random variables following a sub-Gaussian distribution, (i) of Assumption 1 holds. Assumption 2 requires that all the eigenvalues of G 0 and G 0A be distinct with positive gaps between them. We have the following convergence results.
Theorem 1. Assume max i | =j |s ij | < c 0 almost surely and that Assumptions 1 and 2 hold.
If we assume further that G 0 op < C 0 for some constant C 0 independent of p, then for i = 1, . . . , p, Next, we provide an approximation to m net when an additional condition on the largest degree as in Assumption 3 below is imposed. Specifically, we only require the conditional independence property to hold for all but one of the groups. Form net defined in Theorem 2 below, we show in the Supplementary Material that for any permutation {σ (1), . . . , σ (n)}, one can always split the index pairs intom net groups such that the conditional independence property holds for the firstm net − 1 groups. In other words, for anyσ (i) andσ (j) in G s with s = 1, . . . ,m net − 1, we have {σ (i),σ (j)} ∈ V. By combining the conditional independence property for the firstm net − 1 groups with Assumption 3, we show that the conclusions of Theorem 1 still hold, but with m net replaced bym net .
It is easy to see that for the generalized graphon model, d max max i |N i | max i |{j : ξ i is associated with node j}|. Assumption 3 enables us to control them net th group, where the conditional independence property may fail to hold and an upper bound on the number of correlated nodes is then necessary.
Theorems 1 and 2 establish the asymptotic properties ofĜ. The term δ op n in these theorems can be seen as the approximation error, and it is zero when the Z ij have the same distribution. The term (pm 2 net /n) 1/2 , or (pm 2 net /n) 1/2 , can be seen as the estimation error in which m net , orm net , can be loosely understood as the effect of a network. If d max is bounded by a constant, thenm net = O(log n) and the convergence rate ofĜ is O p {(p/n) 1/2 log n}. If d max = O(log n), then upon noting that 1/ log{4d max /(4d max − 1)} = 1/ log[1 + {1/(4d max − 1)}] ≈ 4d max − 1 = O(log n), we havem net = O(log 2 n) and the convergence rate ofĜ becomes O p {(p/n) 1/2 (log n) 2 }. Following the proof of this theorem, it can be seen that if the s ij are independent, then Ĝ − G 0 op = O p {δ op n + (p/n) 1/2 }. Theorem 1 indicates that if d max is small, such as d max = O(log n), the convergence rate ofĜ is similar to that in the independent case, up to a factor of a power function of log n.
In Theorems 1 and 2, the convergence rate of the estimator is established under the generalized graphon model by exploiting its latent variable representation. In fact, as the proof of Theorem 1 shows, the conclusions of the theorem still hold without the generalized graphon model assumption, as long as the following conditional independence property holds. Specifically, a sufficient condition for these theorems to hold is that the node pairs [{σ (2i −1), σ (2i)}, i = 1, . . . , n/2] can be split into groups such that the s ij with (i, j) in the same group are conditionally independent given {X ij }. Here the s ij with (i, j) in different groups can still be correlated.
Using the relationship betweenĜ A andĜ, one can derive the asymptotic properties ofĜ A and its eigenvectors. Consequently, the convergence ofB r, A can be established, by noting thatB r, A is a function of A and the eigenvectors ofĜ A . The same argument is applicable toB r,Â , when A is unknown and estimated asÂ. We make the following assumptions on the estimator of A.
Assumption 4 is standard, and τ n in Assumption 5 is a function of n and p, with p omitted for simplicity. The following theorem gives the convergence rate of the estimator when A is known or estimated asÂ. For simplicity, we assume that r is known.
Theorem 3. Assume that max i | =j |s ij | < c 0 almost surely and that Assumptions 1, 2 and 4 hold. Then the following conclusions hold.
(ii) When A is unknown, assume further that Assumption 5 holds. Then for any given r = 1, . . . , p, We give a concrete example to show the values of δ op n and τ n .
Corollary 2. Suppose that the s ij are identically distributed, but dependent, and that X ij = X i − X j with the X i independent and identically distributed from N (μ, ), where the eigenvalues of are bounded away from 0 and ∞ uniformly over p. Assume thatÂ is taken to be the sample covariance matrix. Then τ n = (p/n) 1/2 and δ op n = 0.
Theorem 3 shows that the convergence rate is determined by the approximation error δ op n , the dimension p of the covariates, the network effect m net , and the convergence rate τ n ofÂ, if A is unknown. As in Theorem 2, we can replace the unknown m net withm net , as shown in the following theorem, the proof of which is the same as that of Theorem 3 and hence omitted. When w ij and hence s ij depend on n, upon replacing the condition max i | =j s ij < c 0 above by max n max i | =j s ij, n < c 0 we can see that the above conclusions still hold. Thus, Theorems 1-4 continue to hold for sparse networks. Finally, we briefly discuss the selection of r, motivated by a similar procedure in Lam & Yao (2012), among others. Recall thatφ A 1 · · · φ A p are eigenvalues ofĜ A . We select r aŝ where M is a fixed number. When A is unknown and estimated asÂ, we use the eigenvalue of GÂ instead.

Simulations
For the model in Proposition 2, it is assumed that E(s ij | X ij ) = h(B T 0 X ij ). Proposition 2 shows that our method can be used to recover span(B 0 ) in this setting. To verify the effectiveness of the proposed network-supervised dimension reduction method in recovering span(B 0 ), we conduct extensive simulations. The performance of our method is examined by computing the error measure P B 0 −PB r,Â F , where B 0 is the true parameter to be estimated,B r,Â is the estimator of B 0 using the method developed in this paper, and P B for any matrix B is the projection matrix onto the space spanned by the columns of B. Here we take A = cov(X 1 ) and take the sample covariance matrix as its estimator. In all simulations, we take s ij = 1 − w ij and assume that r is known. We set n = 100 or 500 and take the dimension to be p = 10 or 50. For each example, 100 datasets are generated. Additional simulations for the selection of r using the method in § 3 are presented in the Supplementary Material. Example 1. We generate data according to the following procedure inspired by a similar set-up in Weng & Feng (2016).
(i) Let C i ∈ {1, 2} be the latent community label. Generate C i from a Bernoulli distribution such that pr(C i = 1) = pr(C i = 2) = 0.5.
where pr(w ij | C i , C j ) is defined as pr(w ij = 1 | C i = C j ) = a and pr(w ij = 1 | C i | = C j ) = b.
In model (5), the first part can be seen as the community effect, and the second part is a logistic model representing the nodal effect. In the simulation, we set r = 1 and B 0 = (1, 1, 0, . . . , 0) T ∈ R p , a = 0.8, b = c com a with c com = 1, 0.5 or 0.1, and c coef = [0.5 : 0.5 : 5], the grid points in the interval [0.5, 5] with step size 0.5. Obviously, c com = 1 corresponds to no community effect, while c com = 0.1 corresponds to a strong community effect. A larger c coef implies a larger nodal effect, and when c coef = 0 there is no nodal effect. The generated networks have a wide range of densities, from 3.8% when c com = 0.1 and c coef = 0.1 to 41.5% when c com = 1.0 and c coef = 0.5. The simulation results are plotted in Fig. 2. Example 2. Consider a setting where each node i is affected by its K neighbours, and denote the set of their indices byN i . The data are generated as follows. (i) First, generateN i for node i. Let μ 1 , . . . , μ n be independent and identically distributed random variables from Un(0, 1), and let d ij = |μ i − μ j | (i, j = 1, . . . , n). For each node i, compute its K nearest neighbours according to the distance d ij . DefineN i as the set that contains those indices j | = i such that node j is one of node i's K nearest neighbours. By where c coef is as specified in Example 1, g({Y ij }) = |Y ij | ∧ k∈N i ,k ∈N j |Y kk |/K 2 and a ∧ b = min{a, b}.
In this model, we have N i = {i} ∪N i with i / ∈N i , whereN i may be seen as the latent neighbour of node i. WhenN i = ∅, we see that node i is affected only by its latent variable Y i , where the Y i are independent of each other. WhenN i | = ∅, we actually have an underlying network introduced by the latent neighbour setsN i . This network can be seen as the underlying truth, whereas the network generated by w ij is an observed one. The added dependence can lead to a flexible model with better interpretation. For example, in the study of genetic data, one might view the underlying network as the true network between genes, and view the observed network represented by w ij as a noisy one contaminated by measurement errors and affected by environmental factors.
We set B 0 = (β 1 , β 2 ) ∈ R p×2 where β 1 = (1, 1, 0, . . . , 0) T ∈ R p and β 2 = (1, −1, 0, . . . , 0) T ∈ R p , and we consider K = 2 and K = 4. For this model, the probability of w ij = 1 depends on latent variables in {Y k : Y k ∈ N i ∪ N j } and the covariates X i and X j . Clearly, this model is a generalized graphon model as defined in § 3. The results of this simulation are shown in Fig. 3. We briefly discuss these results. It is easy to see that the influence of the covariate X ij decreases in both examples when c coef decreases. In particular, X ij has no effect when c coef = 0. We can see from Figures 2 and 3 that the average errors decrease as c coef increases. This is reasonable because the covariates contribute more and more information with increasing c coef . Overall, it is seen that the errors decrease as n increases in both examples, which is expected from the theoretical results on the convergence rate. Interestingly, it can be seen from Fig. 2 that the errors are similar for different c com in Example 1. This is due to the fact that the community label C i is independent of the covariate X ij in the data-generating process. Finally, to illustrate another application of the proposed dimension reduction method, we briefly outline how to select important covariates. Since our method is similar to principal component analysis, which aims to find the eigenvectors of a matrix, we have developed a procedure similar to sparse principal component analysis for obtaining a sparse estimator of the projections as in Zou et al. (2006). More specifically, as in (4), our estimator isB r, A = (β A, 1 , . . . ,β A, r ) withβ A, j = A −1/2φA j , whereφ A j is an eigenvector ofĜ A = A −1/2Ĝ A −1/2 . To implement our procedure, we use the truncated power algorithm in Yuan & Zhang (2013), and illustrate this algorithm as follows in the case where a sparse estimate of β A,1 is sought. Given an initial value v 0 ∈ R p , for t = 1, 2, . . . let v t =Ĝ A v t−1 / Ĝ A v t−1 , truncate A −1/2 v t by keeping only the largest m 0 entries in absolute value, denote the resulting vector by ϑ t , and set v t = A 1/2 ϑ t . Repeat the procedure until ϑ t converges. The final ϑ t obtained is the sparse estimate of β A,1 . In Table 1 we report some preliminary results on variable selection using Example 1 with p = 10 and n = 100 for illustration, where the experiments are run 100 times under each setting. Since the first two variables are significant in this example, we set m 0 = 2 in the algorithm. It can be seen that the results are all satisfactory, especially when c coef > 0.5.

Real data analysis
We apply the proposed method to a pulsar candidates dataset collected by the High Time Resolution Universe survey (Keith et al., 2010), available at http://archive.ics.uci.edu/ ml/datasets/HTRU2. Pulsars are a rare type of neutron star that produces radio emissions detectable on Earth. They are of considerable scientific interest as probes of space-time, the interstellar medium and states of matter. Their study can lead to a better understanding of many physics problems, ranging from acceleration of particles in the ultrastrong magnetic field to tests of gravity in the strong field regime. Since some pulsars are binary systems (Lyne & Smith, 2012), the signals detected for a star may be mixed signals from that star and its neighbours, together with some noise, implying that our generalized graphon model is applicable to such data. In the dataset we analyse, each pulsar is described by eight continuous variables and a single class variable; the data include 16 259 spurious examples caused by radio frequency interference or noise and 1639 real pulsar examples that have been checked by human annotators. The continuous variables are the mean of the integrated profile, the standard deviation of the integrated profile, the excess kurtosis of the integrated profile, the skewness of the integrated profile, the mean of the dispersion measurement versus signal-to-noise ratio curve, the standard deviation of the ratio curve, the excess kurtosis of the ratio curve, and the skewness of the ratio curve. That is, the first four variables are simple statistics obtained from the integrated pulse profile, while the remaining four variables are similarly obtained from the ratio curve. In addition, we observe that the sample covariance matrices of these two groups are different. We randomly select 200 observations from the 16 259 spurious examples and 100 observations from the 1639 real pulsar examples to construct a graph. For these 300 nodes, we say that two nodes are connected if their difference in the first variable, i.e., the mean of the integrated profile, is small. We choose a threshold such that the network density, defined as the ratio of edges to the maximum possible number of edges, is 50%, 30%, 10%, 5%, 3% or 1%. The remaining eight variables are used as nodal covariates. In defining the graph, we do not use the information on the labels of these observations. This data-generating process is repeated 100 times.
For our approach, we take A = cov(X ) and estimate it by the sample covariance matrix. The rank r is chosen by the method outlined at the end of § 3 by setting M = 4. Our approach is compared with the methods of Zhang et al. (2016), Binkiewicz et al. (2017), Huang & Feng (2020) and Yan & Sarkar (2020), as well as the spectral clustering method of Rohe et al. (2011). In addition, we include a version of our approach that estimates A = via the algorithm in the Supplementary Material with r = 1. Since the true community membership of each node is known, we report the average of the proportions of nodes that are misclassified. It is obvious that the smaller this quantity is, the better a method is. The results averaged over 100 random datasets are plotted in Fig. 4. It is observed that our network-supervised dimension reduction approach with the two choices of A performs better than the other methods in most of the cases, especially when the network is sparse. Moreover, our approach is insensitive to the sparsity of the network, while the methods of Binkiewicz et al. (2017), Rohe et al. (2011) and Zhang et al. (2016) work only when the network is dense.
of China and China's National Key Research Special Program. Leng was supported by a Turing Fellowship.

Supplementary material
Supplementary Material available at Biometrika online includes proofs of the theoretical properties and additional theoretical and simulation results. The code for community detection is available at https://github.com/DR-Network/community-detection.