Community detection and reciprocity in networks by jointly modeling pairs of edges

To unravel the driving patterns of networks, the most popular models rely on community detection algorithms. However, these approaches are generally unable to reproduce the structural features of the network. Therefore, attempts are always made to develop models that incorporate these network properties beside the community structure. In this work, we present a probabilistic generative model and an efficient algorithm to both perform community detection and capture reciprocity in networks. Our approach jointly models pairs of edges with exact 2-edge joint distributions. In addition, it provides closed-form analytical expressions for both marginal and conditional distributions. We validate our model on synthetic data in recovering communities, edge prediction tasks, and generating synthetic networks that replicate the reciprocity values observed in real networks. We also highlight these findings on two real datasets that are relevant for social scientists and behavioral ecologists. Our method overcomes the limitations of both standard algorithms and recent models that incorporate reciprocity through a pseudo-likelihood approximation. The inference of the model parameters is implemented by the efficient and scalable expectation-maximization algorithm, as it exploits the sparsity of the dataset. We provide an open-source implementation of the code online.

of reciprocity more similar to those of real data, and better edge predictions. However, this model relies on a pseudolikelihood approximation for parameters' inference, as the model only specifies conditional distributions, but not the joint distribution of a pair of edges. As a result of this approximation, the model is not robust in community detection in the regime where reciprocity plays a role. Peixoto [24] has shown similar results in terms of triadic closure with a model based on Bayesian inference that combines community structure and this network property. This model also assumes conditional independence among edges and models conditional distributions of triadic edges.
Here we propose a model that takes into account community structure and reciprocity by specifying a closed-form joint distribution of a pair of network edges, which does not involve approximations. To estimate the likelihood of network ties, we use a bivariate Bernoulli distribution-a special case of the multivariate Bernoulli distributionwhere the log-odds are linked to community memberships, and pair-interaction variables. Although these patterns are indicative of two distinct mechanisms of network formation, namely, community structure, and reciprocity, it is reasonable to expect that they are related to each other. For instance, i) the preferred connection between nodes of the same community can induce the presence of reciprocated edges involving similar nodes, and ii) the tendency of forming mutual connections can induce the formation of groups of nodes. This conflation means that we cannot reliably interpret the underlying mechanisms of network formation merely from the abundance of reciprocated edges or observed community structure in network data. Our model takes advantage of the useful properties of the bivariate Bernoulli distribution, i.e., the independence and the uncorrelatedness of the component random variables are equivalent and both the marginal and conditional distributions still follow the Bernoulli distribution. Hence, our model has closed-form analytical expressions and enables practitioners to address with more accuracy questions that were not fully captured by standard models; for instance, predicting the joint existence of mutual ties between pairs of nodes. In addition, its algorithmic implementation is efficient and scalable to large system size, as it exploits the sparsity of network datasets, thus allowing its broad applications across disciplines, e.g., citation networks or neuronal networks that consist of several thousand of nodes.

II. The model
The main goal of this work is to develop a probabilistic generative model with latent variables that better captures real scenarios where non-trivial interaction patterns are observed in networks. This is achieved by modeling jointly the edges between the same pair of nodes, differently from standard models that assume their conditional independence given the latent variables. Formally, we model the interactions of N individuals as a binary asymmetric matrix A, with entries A ij defining the presence or the absence of connections from node i to node j.
Thus, P (A ij , A ji |Θ) can be viewed as a member of the exponential family, and can be represented in a log-linear formulation as in Equation (1), where f ij , f ji , and J (ij) represent the natural parameters. J (ij) is called cross-product ratio between A ij and A ji and represents the log-odds of the model. Similar to the Ising model [16], if J (ij) = 0 then the components of the bivariate Bernoulli random vector (A ij , A ji ) are independent, thanks to the equivalence of independence and uncorrelatedness for multivariate Bernoulli distributions [5]. In this case, the resulting model would be equivalent to consider the product of two independent Bernoulli distributions. Another interesting property of the bivariate Bernoulli is that both marginal and conditional distributions are univariate Bernoulli. Thus, our model has closed-form equations for joint, conditional and marginal distributions. We now assume that a set of latent variables capture hidden patterns of the data. There are many possibilities to add these variables: one could act directly on the marginal or conditional first moments, as well as modelling separately the different p αβ , with α, β ∈ {0, 1}. However, we model the log-ratios to ease interpretability and the analytical computations. Specifically, we assume where captures mixed-membership community structure as in De Bacco et al. [7] and η is the pair-interaction coefficient.
The parameters u ik , v jq are entries of K−dimensional vectors u i and v i , the out-going and in-coming communities respectively; and w kq are the entries of a K × K affinity matrix, which regulates the structure of communities, e.g., assortative when its diagonal entries are greater than off-diagonal entries (homophily). Thus, Θ = (u, v, w, η) are the latent parameters we want to infer. Through Equations (3)-(5) we encode the assumptions that community structure drives the process of edge formation, and the edges of a pair of nodes depend on each other explicitly according to the parameter η. When J (ij) = 0, the probability of A (ij) is given by the agreements of the communities of i and j only; while a positive value for the log-odds will boost the chance to observe a tie between them. Conversely, J (ij) < 0 decreases the value of p 11 , the probability that both edges exist. Considering Equation (5): 0 < η < 1 and η > 1 codify a negative and positive interaction between i and j, respectively. The first lowers the probability of observing both ties i → j and j → i, while the latter increases it. Finally, η = 1 implies no interaction between A ij and A ji .
With this model at hand we can estimate observable quantities, valuable for practitioners. For instance, one can ask about the expected value of a given tie in general or conditioned on the existence of the opposite one, quantities defined as: and similar for E [A ji ] and E [A ji |A ij ], see Appendix A. With these quantities one can perform edge prediction tasks, which is crucial when we are limited to a subset of the dataset.

III. Inference
We infer the parameters using a maximum likelihood approach. Specifically, we maximize the log-likelihood with respect to Θ = (u, v, w, η). Adopting a variational approach, this is equivalent to maximize where we introduced the variational distribution ρ ijkq over the parameters and used Jensen's inequality. The equivalence holds when We estimate the parameters by using an expectation-maximation (EM) algorithm where at each step one updates ρ using Equation (11) (E-step) and then maximizes L(ρ, Θ) with respect to Θ = (u, v, w, η) by setting partial derivatives to zero (M-step). This iteration is repeated until the log-likelihood converges. The exact equations for the updates of the parameters are in Appendix A, and the whole routine is described in Algorithm 1. This algorithm is computationally efficient and scalable to large system sizes as it exploits the sparsity of the dataset. Indeed, all the updates involved in the numerator sum over A ij , hence only the non-zero entries count, giving an algorithmic complexity of O(M K 2 ), where M = i,j A ij is the number of ties.
Initialize u, v, w, η at random. Repeat until L convergences: 1. Calculate ρ (E-step): i) for each pair (i, k) update memberships: ii) for each pair (k, q) update affinity matrix: iii) update pair-interaction parameter: Our model (JointCRep) aims to generalize the method presented in Safdari et al. [27] (CRep), which was of inspiration for the latent variables underlying the generative process. We refer to [27] for a detailed explanation of this method and summarize its main properties in Table I.

IV. Results
In this section, we present the results obtained in synthetic and real networks. For comparison we use CRep, the model that combines communities and reciprocity with a pseudo-likelihood approximation [27], and MultiTensor (MT), a community detection-only generative model with a maximum likelihood approach [7]. Even if both of them posit a Poisson likelihood, in this work we use only binary networks for fair comparisons with our model JointCRep. We summarize the main similarities and differences among the models used in the analysis in Table I. and MT models. λ represents the community effect and η is the parameter linked to the reciprocity r. MT is a community detection-only model, therefore it does not have a reciprocity parameter. In addition, it uses the conditional independence assumption according to which the conditional and the marginal distributions coincide. For this reason, the closed-form conditional and joint do not apply for this method.

JointCRep
CRep MT We first validate the performance of the different methods on synthetic data generated with the model proposed in this work. Being a generative model, given as input an initial set of parameters, one can draw a directed network with a community structure and a reciprocity value from the expression in Equation (1). The generative process is described in detail in Appendix B. We analyse networks with N = 1000 nodes, K = 2 overlapping communities, k = 20 average degree and different values of the pair-interaction parameter η such that we obtain networks with reciprocity values (r) in the interval [0, 0.8]. We generate 10 random samples for each value of r. In addition to these results, we provide further details for synthetic networks generated with different values of average degree in Appendix C 3. We test the ability of the models to i) recover the communities, ii) perform edge prediction tasks, and iii) generate sample networks that replicate relevant network quantities.

Community detection
To evaluate the performance of the methods on recovering the communities, we use the cosine similarity (CS), a measure useful to capture mixed-membership communities, as in this case. It ranges from 0 to 1, where 1 means perfect recovery. We calculate the average of the cosine similarities of both membership matrices u and v, and then averaging over the nodes. The results are shown in Figure 1. In the scenarios with low reciprocity values (r < 0.4) all models perform well. However, as r increases, CRep worsens while JointCRep keeps having good results comparable to those of the community-only algorithm, MT. The big drop of CRep is due to the fact that this model gives increasingly less weight to communities as reciprocity increases, as pointed out in Safdari et al. [27]. Conversely, JointCRep is not affected by the different reciprocity values of the data and still performs as good as MT, even by adding another parameter to the model.

Edge prediction
The edge prediction task consists in estimating the existence of an edge by using the inferred parameters. The main quantity used as a score for the estimation of the entries of the adjacency matrix A is the expected value of the marginal distribution. However, our model also provides the conditional distribution; hence its expected value can also be used as a score. The difference lies in the nature of the question we try to answer. We use the marginal distribution to merely predict the existence of an edge, without considering additional information. On the other hand, with the conditional distribution, we ask what is the probability of an edge i → j, conditioned on observing the state of the opposite edge j → i, i.e., knowing if it exists or not. Here, we exploit the presence or the absence of the edge in the opposite direction to better predict each given entry. Furthermore, our model specifies a joint distribution over the edges of a pair of nodes, and this allows us to answer questions more accurately compared to the standard In our experiments below, we test edge prediction with various scores by using 5-fold cross-validation. Specifically, we divide the dataset into five equal-size groups (folds) and train the models on four of them (training data) for learning the parameters; this contains 80% of the possible pairs of nodes in the network, so that we hide pairs of entries (A ij , A ji ) from the training. One then predicts the existence of edges in the held-out group (test set). As performance metric, we measure the AUC on the test data, i.e., the probability that a randomly selected edge has higher expected value than a randomly selected non-existing edge. We compute both the regular, and conditional AUC values. To estimate the regular AUC, we take the expected value E P (Aij |Θ) [A ij ] as the score; while for the conditional AUC, the expected value over the conditional distribution, i.e., E P (Aij |Aij ,Θ) [A ij ] acts as the score. The latter cannot be computed for the community detection-only algorithm, as the marginal distribution is the same as the conditional, and thus the two AUC values coincide. We provide more details in Appendix C 1, where we also show the ability of our model on edge prediction tasks by using the joint distribution. Figure 2 displays the results of the marginal and conditional edge prediction for the different models. JointCRep significantly improves the performance of CRep when using the marginal expected value, and it performs as good as MT. The latter, however, is not able to exploit the additional information given by the existence (or non-existence) of the edge in the opposite direction. This dependence is crucial in networks with reciprocity, i.e., most of the real world datasets, and models with an explicit conditional distribution can better adopt this information to obtain higher performance in edge prediction. Indeed, JointCRep and CRep perform remarkably in this task, and our model presents more robust results both in terms of standard deviation, and growth.

Reproducing the topological properties
A notable property of generative models is their ability to produce synthetic networks based on real-world datasets, such that the synthetic networks imitate the topological features of the real datasets. Following the approach in Safdari et al. [27], for each individual network, we infer the network parameters by applying each model. Then, we use these inferred parameters to generate five network samples. We compare topological properties of these samples with those observed in the ground truth networks used to infer the parameters. In particular, we are interested in measuring reciprocity. Figure 3 shows the performance of each model in reproducing this feature in sampled networks. As it is expected, MT is not capable of reflecting the observed value of the reciprocity in the ground truth network, a clear indication of the shortcoming of models based purely on community structure, which indeed limits their applications. Conversely, JointCRep perfectly reproduces this quantity. CRep generates sampled networks with reciprocity lower than the ground truth due to the fact that it uses a Poisson likelihood resulting in weighted networks. Additional results are provided in Appendix C 2.
To summarize the results on synthetic networks, JointCRep is capable of recovering communities on networks with varying reciprocity values, performing as good as models that are based purely on community structure. This capability overcomes the limitations of the recent CRep model. Moreover, JointCRep includes many performance enhancements in the edge prediction task, i.e., showing improved results in terms of marginal AUC and more robust conditional AUC values. Furthermore, JointCRep is also capable of generating sampled networks with topological features that resemble that of the real data, e.g., reciprocity and average degree. Collectively, these findings show that JointCRep is able to overcome the limitations of both the community detection-only algorithm MT and the model that incorporates reciprocity through the pseudo-likelihood approximation CRep.

B. Analysis of a high-school social network
We now study the social network that describes friendships between boys in a small high-school in Illinois that was collected in the fall of 1957 [3]. Here, a node represents a boy and an edge from an individual i to j shows that node i claimed to be friend of node j. We pre-processed the dataset by removing self-loops and isolated nodes. The resulting directed network has 31 nodes, 100 edges and reciprocity equal to 0.52, i.e., only half of the edges (friendship relationships) are reciprocated. There is no additional metadata to describe the nodes, nor is there an available ground truth for the latent parameters. Therefore, we estimate the number of communities K by performing edge prediction tasks via 5-fold cross-validation with different values of K. For each method the best performance in terms of AUC was achieved with K = 4. Edge prediction also serves as model validation routine in the absence of ground truth information, as it is the case here. We found that results vary depending on the metric considered for evaluation, but in general they confirm that all models are fitting the data well, considering that the dataset is small and highly sparse, thus making prediction tasks hard. Further details for the edge prediction task are in Appendix D 2. Figure 4 visualizes the mixed-membership partitions resulting from the matrix u, inferred by the different methods (similar results are obtained for v). Here we use the inferred value of u, which is obtained from the run with the highest log-likelihood over 100 random initializations of the parameters. All the algorithms assign most of the students to the same groups, except from a central block. Here, MT infers mostly hard memberships and balances the number of nodes in each cluster. Instead, CRep allocates only three nodes with small degree to the green community while places the nodes with higher degree in other clusters. JointCRep, shows a partition that lies in between, by inferring mixed-memberships for those nodes known as bridges. To measure quantitatively the diversity of communities inferred by the various methods, we compute a modularity for directed networks and overlapping communities using different aggregation functions, as proposed by Nicosia et al. [22]. Modularity assumes all communities to have statistically similar properties, in particular to have similar sizes, and it is suited for assortative community structure [10,20]. The communities shown in Figure 4 reflect these properties to a certain extent, and we report the results in terms of overlapping modularity in Table S2. We find modularity values between 0.48 and 0.75, depending on the aggregation function, with JointCRep and MT achieving the highest values.
Given the inferred parameters, we can test the ability of the models to reconstruct the input network, by using either the marginal expected value E P (Aij |Θ) [A ij ], or the conditional expected value E P (Aij |Aij ,Θ) [A ij ] as the score. Note that the latter is not available for MT because the conditional and marginal distributions coincide. Figure 5 presents the results, where edge width and darkness of the reconstructed networks are proportional to the weight given by the expected score (for visualization clarity, we show only edges with weight greater than 0.2). The network estimated by CRep, which uses the expected value of the marginals, does not capture the structure of the data magnificently, as it overestimates the presence of edges. This model specifies conditional distributions and relies on a pseudo-likelihood approximation; since this approach is not necessarily accurate enough to approximate marginals, such results are expected. Instead, MT and JointCRep estimate a sparser representation that is closer to the observed network. However, MT is not able to notably detect reciprocated edges, e.g., (10,18)  remarkably recovers this type of interactions more precisely. For both JointCRep and CRep, including the conditional expected values improves their accuracy in reconstruction, resulting in identifying reciprocal edges correctly. The difference between the two models lies on the intensity: for instance JointCRep predicts the pair of edges between nodes 10 and 18 with a high probability, while CRep assigns a much lower probability to them. Hence, JointCRep is not only able to predict edges more precisely, but it also does so with higher probability. These qualitative observations are also confirmed by quantitative comparisons in terms of the Log Loss and the L1 Loss, two penalty metrics computed between the reconstructed and the true networks. They measure the difference between two input networks by taking into account the probability of the existence of an edge and computing a penalty for each mistake in predicting the observed value. A penalty of 0 denotes perfect reconstruction, as when assigning probability 1 to the observed edge values. In general, lower values indicate higher similarity. Further details can be found in Appendix D 1. We find that JointCRep achieves the lowest values (i.e., better performance) in both metrics, with best overall performance achieved using the conditional expectation. See Table S3 for details. ; for visualization clarity we show only edges with weight greater than 0.2. Node size is proportional to the degree (in-and out-degree) and node labels represent node IDs.
To further compare the strength of these methods, we examine their performance in generating samples that resemble the observed network. For each model, we use the inferred parameters to generate five synthetic networks, as shown in Figure 6. Again, we notice how the samples generated by JointCRep better resemble the observed network, as it is easier to distinguish the four blocks generated by JointCRep, compared to the samples from the other algorithms. In particular, JointCRep finds denser groups given by reciprocated edges. In addition to these qualitative results, Table S4 reports the topological properties of the observed data and the sampled networks, showing that JointCRep generates networks samples that on average are most similar to the observed data in terms of average degree, reciprocity and clustering coefficient. For each method, we first infer the parameters choosing the run with the highest log-likelihood over 100 random initializations of the parameters. Then, we generate the samples by using as input in the generative models the inferred parameters and the average degree of the original data. The generative process of JointCRep is described in Appendix B; for CRep we use the formulation described in [27]; MT follows the formulation of a standard mixed-membership variant of a stochastic block model, as described in [7].

C. Analysis of a vampire bat network
As a second example, we study the network of food sharing interactions in captive vampire bats, collected by Carter and Wilkinson [2]. These animals often regurgitate food to roost-mates that fail to feed. The decision of who to feed may depend on both kin relatedness and reciprocal sharing. Hence, in this dataset we expect reciprocity to be an important factor for tie formation. In the study, they fasted 20 vampire bats and induced food sharing on 48 days, over a 2 year period. They showed that reciprocal sharing predicts future food regurgitation more than relatedness or other non-kin food sharing behaviors, such as harassment. From the collected data, we construct a directed network by building an edge from a bat i to another j if node i fed j at least once. We removed isolated nodes and obtained a network with 19 nodes, 103 edges and reciprocity equal to 0.64. We fix the number of communities K = 2 and analyse the data with the different methods. As for the high-school data, results of edge prediction tasks for model validation confirm that all the models represent the data well, see Table S5 for details. We are now interested in measuring the ability of the models to recover the observed network with the inferred parameters, in particular their ability to recover topological properties such as reciprocity. To this aim, we consider the marginal and the conditional expected values, as in Section IV B. Figure 7 shows the adjacency matrix of the data and its different estimates, obtained by each method. The network embodies a core-periphery structure, where the main core (labels 0 − 9) is made of female bats. JointCRep recovers this structure more accurately than other methods, the off-diagonal entries show this fascinating result clearly, while the other methods overestimate the amount of edges. Similarly as observed in the high-school network, our model is not only more accurate, but also assigns higher probabilities to these entries and best performs both in terms of Log and L1 Loss, see Table S6. 0 1 2 3 4 5 7 8 9 10 11 12 13 17 19 20 21 23  In addition to the marginal and conditional expected value, we can consider the joint distribution to estimate the entries of the adjacency matrix. This is equivalent to assign a value to each pair (A ij , A ji ) from the set {(0, 0), (0, 1), (1, 0), (1, 1)}, that transforms the edge prediction task into a classification problem. We predict the label associated to the highest probability among [p 00 , p 01 , p 10 , p 11 ], where these are computed by using Equations (S1)-(S4) with the inferred parameters. We assess the goodness of our performance by computing the precision and recall of the predicted labels versus the true labels, as shown in Figure 8. The precision identifies the proportion of correctly classified observed entries. The figure illustrates high precision values consistently across edge labels, as the highest entries are along the diagonal. In particular, JointCRep is able to correctly classify the pairs (0, 0) and (1, 1). Observing where our model misclassifies, this mainly happens by predicting no edges, i.e., assign label (0, 0), when the true ones are either (0, 1) or (1, 0), implying a tendency to estimate sparser networks. On the other hand, the recall indicates the proportion of predicted edges being correctly classified. Also in this case, the highest entries are in the main diagonal and in predicting the pairs (0, 0) and (1, 1). Overall, in this case we obtain higher intensities as for the precision, indicating the tendency of labeling the predicted edges with the right type.
To conclude our analysis, we compare five random samples generated with the inferred parameters of each model and check whether they reproduce topological properties as those observed in the real data. Table II shows that JointCRep outperforms other models in terms of all topological properties. In particular, it generates sampled networks with reciprocity values closest to the real data, but also reproduces realistic values of the clustering coefficient.

V. Discussion and conclusion
In this paper, we have presented a generative model called JointCRep that takes into account community structure and reciprocity by specifying a closed-form joint distribution of a pair of network edges, without relying upon approximations. Our model also provides closed-form analytical expressions for both the marginal and conditional distributions, and enables practitioners to address with more accuracy questions that were not fully captured by  We first validated our model by applying it to synthetic network datasets, where we achieved remarkable performance in recovering communities, edge prediction tasks, and generating synthetic networks that replicate topological features observed in real networks. We then analyzed two real datasets that are relevant for social scientists and behavioral ecologists, where we found that JointCRep obtains more robust and interpretable results. The results shown in this work highlight main benefits of using a model that considers closed-form joint distributions of pairs of edges in networks, while also showing possible shortcomings of other approaches. While it is difficult to pinpoint theoretical reasons for these shortcomings, the variety of experiments that we discussed throughout this manuscript show possible practical consequences of them. In particular, standard generative models make strong conditional independence assumptions that reflect in poor recovery of topological properties as reciprocity. On the other hand, models that specify only conditional distributions rely on pseudo-likelihood approximations that may reflect in weak recovery of latent parameters as communities in certain regimes. Collectively, our model is able to overcome the limitations of both these approaches thanks to the modeling of closed-form joint distributions while also keeping computational complexity under control.
The framework we described could be extended in a number of ways. JointCRep analyses binary and singlelayer networks; therefore, possible extensions could account for weighted and possibly multilayer networks, where we have edges of different types. Another approach could consider dynamic networks, which have evolving structure over time, and adapt the parameters accordingly [28]. Moreover, our model captures the reciprocity through a unique pairinteraction parameter for the whole network. This model could be improved in the future by including node-dependent parameters in scenarios where reciprocity varies between individuals. Furthermore, many real world datasets contain attributes that provide additional information about their features. Incorporating these extra informations on nodes could result in a more realistic analysis [4].
JointCRep, to the best of our knowledge, is the first such method for fully capturing reciprocity by jointly modeling pairs of edges with exact 2-edge joint distributions. We believe it will serve as a baseline for future models that tackle more complicated interactions that go beyond pairwise interaction, e.g., triadic closure [24].
Supporting Information (SI)

A. Detailed derivations
Combining Equations (2)-(5) we get the explicit mapping between the latent variables and the instances of the joint probability in Equation (1): where the normalization constant is: One property of the bivariate Bernoulli is that both marginal and conditional distributions are univariate Bernoulli. Thus, the marginal distributions of A ij and A ji have densities: while the conditional distributions are the following: . (S9) In addition to the expected values reported in the manuscript, we can also compute the variances and the covariance between the random variables: . (S12) At each step of the EM algorithm one updates ρ using Equation (11) (E-step) and then maximizes L(ρ, Θ) with respect to Θ = (u, v, w, η) by setting partial derivatives to zero (M-step). The derivative w.r.t. η is given by: that leads to: (S14) Similarly, we get the updates for u, v and w: (S17) B. Benchmark generative model The model we propose in the manuscript is able to generate synthetic data with intrinsic community structure and a reciprocity value. It takes as input a set of membership vectors, u i and v i , affinity matrix w, and a pairinteraction parameter η; the output is a directed and binary network with adjacency matrix A whose pairs of edges are conditionally independent from each other. We use the same formulation as in Safdari et al. [27], but our approach differs in that edges between a given pair of nodes are generated stochastically according to the joint probability in Equation (1), and not according to a two-step sampling procedure. In detail, we assign a value to each pair (A ij , A ji ) by considering the vector of cumulative probabilities built using Equations (S1)-(S4). To enforce sparsity, we multiply λ by a constant ζ, and in order to automatically rescale the expected value in Equation (7) we have to impose and solve with respect to ζ, where E [M ] is the expected number of edges, a quantity given in input. The benchmark we propose here differs from the one presented in Safdari et al. [27] for multiple reasons, as we summarize in Table I. In addition to those, it is worth mentioning that the competing benchmark in Safdari et al. [27] depends on a variable, cr ratio = 1 − η, that controls the proportion of edges generated purely by either community or reciprocity effect. This implies that in order to have high reciprocity we may weaken the impact of community effect. This does not happen with the model we propose here, as tie formation can be highly influenced by both reciprocity and community structure at the same time, thus providing a more reliable and truthful representation in certain real world examples.
In the manuscript, we use networks generated with the benchmark proposed above where we fix N = 1000 nodes, K = 2 overlapping communities, k = 20 average degree, and different values of the pair-interaction parameter η such that we obtain networks with reciprocity values r in the interval [0, 0.8]. In detail, we use η ∈ {0.1, 10, 20, 40, 80, 140, 280, 500, 1500} to get r ∈ {0, 0.1, 0.2, . . . , 0.8}, and we generate 10 different samples for each value of η. Additionally to the data presented in the manuscript, we also report in Appendix C 3 further results on synthetic data generated by varying the average degree k in the interval [2,18] while fixing η = 1000 and the other parameters as above. To generate the membership matrices u and v we first assign an equal-size unmixed group membership and then we apply the overlapping to 20% of the nodes by drawing those entries from a Dirichlet distribution with parameter α = 0.1. The affinity matrix w is generated using an assortative block structure with main probabilities p 1 = k K / N and secondary probabilities p 2 = 0.1 p 1 . Thus the latent variables Θ = (u, v, w, η) are fixed. Then, edges are drawn according to the generative model described above.
For sake of completeness, we also analysed synthetic networks generated with the model proposed in Safdari et al. [27] obtaining similar results and same conclusions. Furthermore, we investigated the behaviour of the models on networks with more than 2 communities and noticed that results are not highly impacted by this parameter. We do not report them here for sake of brevity.
C. Results on synthetic data

Edge prediction
We test edge prediction by using a 5-fold cross-validation routine: we divide the dataset into five equal-size groups and train the model on four of them (training set) to infer the parameters; the fifth group is then used as test set to evaluate the existence of edges A ij (in this set). By varying which group we use as test set, we get five trials per realization and the final score is the average over these. To divide the dataset into five folds, we use a symmetric mask, i.e., in each trial the training set contains the 80% of the possible entries (A ij , A ji ). In the manuscript we show the performance of the models in edge prediction when using the marginal and conditional expected values, E [A ij ] and E [A ij |A ji ] respectively. Here, we measure the AUC that is equivalent to the area under the receiver-operating characteristic (ROC) curve [13]. In addition to these results, we can exploit the full joint distribution of our model to answer questions like what is the probability of jointly observing both edges i → j and j → i? This is equivalent to assign a value to the pair (A ij , A ji ) from the set {(0, 0), (0, 1), (1, 0), (1, 1)}, that translates the edge prediction task into a classification problem. However, this problem becomes trivial if the model predicts all entries equal to (0, 0): in this case we will get high performance just because of the high sparsity of the data. For this reason, we compute the accuracy only for entries in the test set that have at least one edge. For those, we predict the label associated to the highest probability among [p 01 , p 10 , p 11 ], where these are computed by using Equations (S1)-(S3) with the inferred parameters. We then compute the accuracy between true and predicted labels, where a value equal to 1 means perfect recovery. As baselines, we use a uniform random probability over the number of possible labels in the training set (RP), and the accuracy obtained by using as prediction the label with the highest relative frequency in the training set (MRF). The results are shown in Figure S1, where we can observe a V-shape. Reciprocity equal to zero (r = 0) means the networks have no reciprocated edges, and higher its value higher the frequency of the label (1, 1). Thus, in the regime 0 ≤ r ≤ 0.5 the performance decreases because the problem becomes more difficult by reaching the point where labels have similar relative frequencies (MRF ≈ RP when r = 0.5). In this scenario, JointCRep outperforms the baselines with a bigger gap as the reciprocity increases. When r > 0.5 the problem becomes easier due to the increasing proportion of the label (1, 1). Here, predicting all entries equal to (1, 1) results in higher performance (MRF). However, this is another trivial situation that should be ignored when analyzing the performance in edge prediction tasks. Results are averages and standard deviations over 10 synthetic networks and over 5-folds of cross-validation test sets. Edge prediction performance is measured with accuracy, and as baselines we consider the uniform random probability (RP) and the maximum relative frequency (MRF). Figure S2 shows the performance of each model in reproducing the average degree in sampled networks. While JointCRep and MT recover this feature despite the different values of reciprocity, CRep produces samples with a lower average degree than the one given in input as r increases. This happens because, in high reciprocity settings, CRep produces sampled networks with fewer edges but higher weights. Hence, while the average degree decreases, the weighted average degree better reflects the input feature (not shown here).

Analysis on synthetic data with different average degrees
In addition to the results provided in the manuscript, we also analyse synthetic networks with different values of average degree. Figure S3 shows the performance of the models in community detection and edge prediction tasks, as well as in reproducing topological properties in sampled networks. Similar to the results in the manuscript, JointCRep follows the behaviour of MT both in terms of CS and marginal AUC, for which performance improves as the average degree increases, as expected for community detection-only methods. On the other hand, CRep is only partially affected by the different values of average degree, as also shown in [27]. The plots highlight that the strength of CRep is not retrieving communities, rather its ability to predict missing edges by using the conditional distribution, regardless the average degree. In Figure S3 we can also notice that even though JointCRep is affected by the average degree, its conditional AUC improves the marginal AUCs already when k = 4, and it reaches good values from k = 10. Moreover, JointCRep outperforms the other methods in recovering reciprocity in sampled networks across different values of average degree. Overall, these results suggest that JointCRep is a valuable tool also in networks with low-medium average degree providing good communities, reasonable edge predictions, and sampled networks with topological features that resemble that of the real data. In addition to the AUC, we use the Log Loss (or Binary Cross-Entropy) and the L1 Loss (or Mean Absolute Error) to measure the performance of the methods in edge prediction and network reconstruction. The Log Loss for binary classification is defined as where A ij indicates the entry of the adjacency matrix, P (A ij ) denotes the probability of the existence of the edge, and M is the total number of edges. Instead, the L1 Loss is given by For both metrics, lower values indicate better performance and a loss of 0 denotes perfect predictions. While the Log Loss does not have an upper bound, the L1 Loss is equal to 1 in the worst-case scenario of predicting every existing edge with probability P (A ij = 1) = 0 and every non-existing edge with probability P (A ij = 1) = 1. Moreover, they differ in the extent to which they penalize mistakes: the Log Loss is more sensitive to large disagreements between true and predicted values than the L1 Loss. This means that the Log Loss prefers predictions with more mistakes of low magnitude than predictions with fewer mistakes but of larger magnitude.
2. Analysis of a high-school social network Table S1 displays the results for the edge prediction task in the high-school social network. CRep performs the best both in terms of AUC and Log Loss when using the conditional probabilities. On the other hand, JointCRep is the best when considering the L1 Loss. This is explained by the behaviour of our model that tends to predict fewer edges with more intensity, differently to the other models which predict more edges with low-medium probabilities. As a remark, the dataset presents an average degree k = 6.45 and it is highly sparse. This feature makes this task hard because there is only little information in input when considering 5-fold cross-validation splits, and some folds may result in unreliable results. Nevertheless, results show that all models are performing reasonably well at this task given this sparse regime. Comparing the communities inferred by the various methods, Table S2 shows the overlapping modularity obtained for the partitions of Figure 4 for various aggregation functions. Table S3 reports the penalties for the network reconstruction task, showing that JointCRep has best performance in terms of both Log Loss and L1 Loss. Finally, Table S4 shows the topological properties in the high-school social network and its sampled networks, showing how JointCRep achieves on average values that are more similar to those observed on the input data.
TABLE S1: Edge prediction in the high-school social network. Results are averages and standard deviations over 5-folds of cross-validation test sets. Edge performance is measured with three different metrics F : AUC, Log Loss, and L1 Loss. The AUC measures the probability that a randomly selected edge has higher expected value than a randomly selected non-existing edge, and the baseline is the random value 0.5. The Log Loss and the L1 Loss are penalty measures defined in Appendix D 1, that quantify the difference between two input networks by taking into account the probability of the existence of an edge and computing a penalty for each mistake in predicting the observed value. The metrics are computed by using either the marginal probability P (A ij |Θ) or the conditional probability P (A ij |A ji , Θ). Note that the last is not available for MT because the conditional and marginal distributions coincide. The best performance for each metric is in bold.   Table S5 shows results for edge prediction tasks using a 5-fold cross-validation routine. Similarly to the high-school dataset, all the models obtain good performance given the sparse regime, although values were slightly better in the high-school case. Also in this case, JointCRep achieves best results in terms of L1 Loss, and CRep is more robust in terms of Log Loss. Table S6 reports the quantitative results for the network reconstruction of the vampire bat dataset, in terms of Log and L1 Loss. Here, JointCRep outperforms the other methods as also shown in Figure 7. S3: High-school network reconstruction: comparison between true and reconstructed networks. F denotes the function considered in each row, defined in Appendix D 1. The metrics are computed by using either the marginal probability P (A ij |Θ) or the conditional probability P (A ij |A ji , Θ) of each method with the inferred parameters. Note that the last is not available for MT because the conditional and marginal distributions coincide. The best performance for each metric is in bold.   Loss, and L1 Loss. The AUC measures the probability that a randomly selected edge has higher expected value than a randomly selected non-existing edge, and the baseline is the random value 0.5. The Log Loss and the L1 Loss are penalty measures defined in Appendix D 1, that quantify the difference between two input networks by taking into account the probability of the existence of an edge and computing a penalty for each mistake in predicting the observed value. The metrics are computed by using either the marginal probability P (A ij |Θ) or the conditional probability P (A ij |A ji , Θ). Note that the last is not available for MT because the conditional and marginal distributions coincide. The best performance for each metric is in bold.