Sparse Estimation of Huge Networks with a Block�?Wise Structure

Networks with a very large number of nodes appear in many application areas and pose challenges for traditional Gaussian graphical modelling approaches. In this paper, we focus on the estimation of a Gaussian graphical model when the dependence between variables has a block‐wise structure. We propose a penalized likelihood estimation of the inverse covariance matrix, also called Graphical LASSO, applied to block averages of observations, and we derive its asymptotic properties. Monte Carlo experiments, comparing the properties of our estimator with those of the conventional Graphical LASSO, show that the proposed approach works well in the presence of block‐wise dependence structure and that it is also robust to possible model misspecification. We conclude the paper with an empirical study on economic growth and convergence of 1,088 European small regions in the years 1980 to 2012. While requiring a priori information on the block structure – e.g. given by the hierarchical structure of data – our approach can be adopted for estimation and prediction using very large panel data sets. Also, it is particularly useful when there is a problem of missing values and outliers or when the focus of the analysis is on out‐of‐sample prediction.


INTRODUCTION
Estimation of large covariance matrices and their inverse has several applications in various areas, from economics and finance to health, biology, computer science and engineering. One important technique developed by the statistical and computer science literature is the graphical modelling approach, which aims at exploring the relationships among a set of random variables through their joint distribution. Under this framework, the Gaussian distribution is often assumed and, in this case, the dependence structure is completely determined by the covariance matrix, or, equivalently, by its inverse, where the off-diagonal elements are proportional to partial correlations (Lauritzen, 1996). Specifically, variables i and j are conditionally independent given all other variables, if and only if the (i, j )th element of the inverse covariance matrix, referred to as the precision matrix, is zero. One result in the Gaussian graphical modelling literature is that there is a one-to-one correspondence between the joint Gaussian distribution of a vector S62 F. Moscone, E. Tosetti and V. Vinciotti of random variables and its conditional Gaussian distribution. Under the latter, the distribution of a variable observed in a certain node, given values observed in all other nodes, depends only on the observations in its neighbourhood; see, e.g. Mardia (1988) and Meinshausen and Buhlmann (2006). Hence, the problem of estimating the (inverse) covariance matrix is equivalent to a neighbourhood selection problem. This observation has led to efficient node-wise LASSO approaches for sparse high-dimensional graphs; see, e.g. Meinshausen and Buhlmann (2006) and Peng et al. (2009). In contrast to these approaches, Friedman et al. (2008) have developed the Graphical LASSO (GLASSO) approach, where the inverse covariance matrix is directly estimated via penalized likelihood.
Conditional Gaussian models are known in the spatial econometrics literature as conditional autoregressive (CAR) models, representing data from a given spatial location as a function of data in neighbouring locations; see, e.g. Cressie (1993) and Anselin (2010). In a CAR model, the neighbourhood structure is represented by means of the so-called spatial weights matrix, usually assumed to be known a priori using information on distance between units, such as the geographic, economic, policy or social distance. It is interesting to observe that the problem of estimating the spatial weights matrix in a CAR model is equivalent to a neighbourhood selection problem in a graphical model; for more details, see Section 5. Hence, the spatial weights matrix for CAR models can be estimated by using methods from the Gaussian graphical modelling literature for estimating inverse covariance matrices. While the spatial econometrics literature has been largely immune to the developments in Gaussian graphical modelling, these methods may be useful for a large number of applications in the social sciences.
In this paper, we consider the case of networks with a very large number of nodes and we focus on the estimation of Gaussian graphical models when the dependency between variables has a block-wise structure. We assume that units can be split into a set of non-overlapping groups, or blocks, in such a way that the dependence between units only varies across blocks, instead of individual observations. Hence, rather than estimating the links between each pair of units in the sample, we propose to estimate the dependence (links) between groups of cross-sectional units. Our approach consists of applying the GLASSO methodology of Friedman et al. (2008) to blocklevel averages of observations rather than to single observations. When the size of the group is unity, our method collapses to the conventional GLASSO. A major advantage of this method is that its computational cost is greatly reduced and hence it can be adopted for estimation and prediction using very large, or huge, networks. Our approach is also particularly useful when there is a problem of missing values and outliers or when the focus of the analysis is out-ofsample prediction.
There exist several examples where it is reasonable to assume a block-wise dependence structure between units. In economics, preferences for consumer goods of individuals belonging to the same household may react similarly in response to consumption decisions of neighbouring households. Companies belonging to the same sector of economic activity and located within the same geographical area (e.g. the postcode, the region or the country) tend to behave similarly because they have similar characteristics or face similar opportunities and constraints. Thus, it is reasonable to assume that the way they interact with companies from other sectors and/or geographical areas is similar. A block-wise dependence structure is also a realistic assumption when the variable of interest displays an explicit hierarchical or group membership structure, namely, clustering of units in an organized fashion, such as students within classrooms, members of a household, General Practitioners in a clinic, etc. This is common, for example, when dealing with large, individual-level, microeconomic or health data sets. Other examples are in neuroscience, where the networks used to represent brain activity have a hierarchical structure, S63 with billions of neurons connected to each other through hub nodes, called voxels, and with connected voxels forming areas that are again connected with each other (Luo, 2015). In biology, regulatory networks are thought to have a hub-type structure, with groups of genes having a similar dependency structure and regulated by a small number of unobserved proteins (Hao et al., 2012). When the grouping is not fully known a priori, we could use methods that allow us to determine endogenously the optimal grouping of cross-sectional units, by employing techniques from the clustering literature; see, e.g. Lin and Ng (2012), Bonhomme and Manresa (2015) and Ando and Bai (2016).
Exploitation of a priori information on the group structure of variables is not new in the social interaction literature and in the statistical and graphical modelling literature. Empirical works from the social interaction literature typically assume that an individual reacts to the average of others in a predefined group; see Durlauf and Young (2001) and Blume et al. (2013) for a review. Such an assumption implies that the spatial weights matrix has a group-membership structure, where the weights are identical for all units belonging to the same group, while they are set to zero for the interaction between units belonging to different groups. Lee and Yu (2007) considered the identification and estimation of interaction effects in the context of a spatial autoregressive model where the spatial weights matrix (and the associated precision matrix) has such a block diagonal structure with equal entries. Note that this is a more restrictive assumption to that used in this paper, as it does not allow for dependences between groups. Nevertheless, this model has been widely adopted in several different areas of the social sciences, such as education (Calvó-Armengol et al., 2009), labour market outcomes (Bayer et al., 2008), crime (Sirakaya, 2006) and welfare participation (Bertrand et al., 2000). Similar models have been proposed by the statistical literature, where mixed effect models are commonly used to represent variables with a hierarchical or known group membership structure (Goldstein, 2011). When the random effects are assumed to be correlated, these models lead to a covariance matrix that has a block-wise structure of the same type that we use in this paper, with equal correlation within groups and equal correlation between any two elements of two specified groups (Laird and Ware, 1982). Maximum likelihood approaches are typically used for parameter estimation in these models. In the case of a large number of regressors, penalized approaches based on the L 1 penalty are used for estimation and variable selection (Schelldorfer et al., 2014). However, these methods typically require a small number of random effects (blocks).
A number of authors in the literature on graphical modelling have proposed sparse estimation of graphs with a block structure. These methods exploit a priori information on group membership of observations to propose fast, sparse estimation algorithms. Guo et al. (2011) consider a heterogeneous data set where variables, while independent across groups, have a sparse dependency structure within group. The corresponding precision matrix has a block diagonal structure, and the authors propose joint estimation of various blocks by maximizing the corresponding penalized log-likelihood functions. A similar approach is taken by Mazumder and Hastie (2012), who propose thresholding estimation of a sparse inverse covariance that is a block diagonal matrix of connected components. Wit and Abbruzzo (2015) impose block equality constraints on the parameters of an undirected graphical model to reduce the number of parameters to be estimated. Vinciotti et al. (2016) discuss various forms of block structures for dynamic networks and propose estimation of the associated precision matrix under sparsity and equality constraints on parameters (also known as parameter tying). The inclusion of equality constraints, while reducing the number of parameters, often increases the computational complexity of the estimation procedures. For example, the general block structures considered by Wit and Abbruzzo (2015) and Vinciotti et al. (2016) imply a computational cost of the estimation procedure that is higher compared to the approaches of Guo et al. (2011) and Mazumder and Hastie (2012), where the assumed block structure allows the large GLASSO problem to be split into many, smaller tractable problems.
In this paper, we use block structures with the intent to achieve computational efficiency, allowing us to infer networks of very large dimensions. Differently from Guo et al. (2011) and Mazumder and Hastie (2012), our approach does not need to impose block-diagonality of the precision matrix. However, we assume that units can be split into groups in such a way that the covariance (and associated precision matrix) only varies across blocks, rather than individual observations. The rest of the paper is structured as follows. In Section 2, we describe the main features of our graphical model with block-wise dependence structure, while in Section 3 we propose our estimator based on GLASSO. In Section 4, we run Monte Carlo experiments to investigate the small-sample properties of the proposed estimator. In Section 5, we carry out an empirical study on the economic growth of a set of small regions in Europe. Finally, in Section 6, we provide some concluding remarks. The Appendix provides the proofs.
We use |λ 1 (A)| ≥ |λ 2 (A)| ≥ · · · ≥ |λ n (A)| to denote the eigenvalues of a matrix A ∈ M n×n , where M n×n is the space of n × n matrices. Tr(A) is the trace of A ∈ M n×n , while its Frobenius norm is A F = ( m i,j =1 a 2 ij ) 1/2 . K is used for a fixed positive constant that does not depend on N; S c is used to denote the complement of a set S.

BLOCK-WISE DEPENDENCE STRUCTURE IN HUGE NETWORKS
Let y it be the observed data for the ith individual, i = 1, 2, . . . , N, at time t, with t = 1, 2, . . . , T , and assume that the N -dimensional vector y t = (y 1t , y 2t , . . . , y Nt ) ∼ N (μ, ), where is an N × N symmetric and positive definite matrix, independent of t. For ease of exposition, we set μ = 0, although this assumption can be relaxed by setting μ to a non-zero vector depending on a set of strictly or weakly exogenous regressors, including, for example, temporal lags of the dependent variable. Assume that the variables can be split into G non-overlapping groups, with G ≤ N , such that the dependence between individuals belonging to different groups is the same for all individuals belonging to the same group. Suppose, for simplicity, that all groups are of the same size M = N/G, where M is an integer number. Under this assumption, has the following block-wise structure: where σ gg are intra-group covariances, while δ g are group-specific variances, for g = 1, 2, . . . , G. Let where γ g = δ g − σ gg ≥ 0. Then, can be written in compact form as where G is a G × G matrix assumed to be positive definite. If has the above block-wise structure, then its inverse, namely the precision matrix, is also block-wise. To show this, rewrite (2.5) Noting that (1/M)1 M and (I M − (1/M)1 M ) are idempotent matrices such that their sum is the identity matrix, we can apply Lemma 2.1 (point (iv)) in Magnus (1982) to obtain Assuming that the matrix ( G + (1/M) G ) −1 has generic elements φ gh , the likelihood function has the simplified expression: See Appendix A for a proof. Below, we propose a penalized maximum likelihood approach to estimate and , which exploits the block-wise dependence structure and is based on the GLASSO.

BLOCK-GLASSO APPROACH
To propose our estimator, consider the group averages and note that, if y t ∼ N (0, ), where is given by (2.5), then also y G,t = (ȳ 1t ,ȳ 2 , . . . , or, in matrix form, It follows that we can estimate by applying the GLASSO to the vector of group means, y G,t . More specifically, consider the following two-step procedure.

STEP 2. Estimate γ g by exploiting identity (2.4) and (3.4). Noting that
2 gt ] = σ gg + (1/M)γ g , we can consider the following estimator forγ g : Hence, use (2.6) to recoverˆ : In Step 1, the estimator that maximizes the penalized likelihood for y G,t is where the maximization is taken over symmetric positive definite matrices, S G is the sample covariance matrix, and ρ G is the tuning parameter controlling the degree of the sparsity in the estimated inverse covariance matrix.
The following theorems derive the asymptotic properties of estimator (3.6) when both N and T go to infinity.
Let be an estimate of following Steps 1 and 2, where ρ G = O( √ (ln G/T )), with ρ G being the tuning parameter in (3.7). Then, we have

the set of indices of all non-zero offdiagonal elements in . Then, with probability tending to 1 we haveθ ij
Theorem 3.2 is a straightforward consequence of the sparsistency theorem of Lam and Fan (2009) applied to G ; see also Rothman et al. (2008) and Guo et al. (2011).
Hence, for to be a good proxy of , G needs to be small (or, equivalently, M large) and G needs to be a sparse matrix, as measured by s G . Note, however, that, from (2.6), the off-diagonal elements of are proportional to 1/M 2 . Hence, for fixed G, as M increases the (relative) effect of each individual neighbour on each unit would disappear and, in the limit, the precision matrix would become a diagonal matrix. A similar result has been obtained by Lee (2002) in the context of a spatial autoregressive (SAR) model where each spatial unit is influenced aggregately by a significant portion of other spatial units in the sample. Lee (2002) showed that if each spatial unit in the limit has infinitely many neighbours (which would happen in our case for G fixed and M increasing), then the ordinary least-squares (OLS) estimator for a SAR model would still be consistent and even asymptotically efficient. In Section 4, we investigate the properties of our estimator for different values of G relative to N.
A major advantage of our proposed estimation procedure is that it is considerably faster than the conventional GLASSO for estimating an N × N precision matrix. Using the algorithm proposed by Friedman et al. (2008), the computational cost associated with a coordinate descendent update would decrease from O(N 2 ) to O(G 2 ). This could decrease further to O(G) using faster algorithms, such as QUIC (Hsieh et al., 2014). Another advantage of our approach is that using block averages rather than single observations greatly helps in the presence of missing values, a common problem in statistical analysis. Exploiting group membership information is also very useful for prediction purposes on a hold-out sample of units, for which the position in the (individual-level) network is usually unknown. It is important, however, to remark that our approach requires a priori information on the block structure. If this is not available, then one could exploit methods from the clustering literature that allow us to determine endogenously the optimal grouping of cross-sectional units, such as the k-means algorithm (Forgy, 1965) extended to allow for covariates in the model; see, in particular, Lin and Ng (2012) and Bonhomme and Manresa (2015), and also Ando and Bai (2016). Our approach also has potential application in the area of spatial econometrics. Given the equivalence between CAR models and the joint Gaussian distribution emphasized by many authors -see, among others, Mardia (1988) and Meinshausen and Buhlmann (2006) -this method provides a means for estimating spatial weights matrices in the context of very large panel data. Later in the paper, we offer a small empirical exercise using CAR models.
Finally, it is important to remark that our approach does not allow us to estimate consistently the precision matrix when this arises from one or more common, pervasive factors. Unobserved common factors occur in time series as a result of global shocks, namely unexpected events that may hit all statistical units, although with different intensities (Stock and Watson, 2010). These large-scale perturbations affect micro-level population units and are often responsible for observable co-movements of a large number of time series. We observe that our model is more parsimonious than the common factor specification and may be useful in situations where T is too short to allow for fully unrestricted common effects. However, in a large T setting, in the presence of unobserved common factors, our approach can be applied to de-factored residuals, after estimating common factors using methods such as principal components (Bai, 2003) or the Common Correlated Effects methodology (Pesaran, 2006).

Case of blocks with unequal size
Suppose now we have blocks with unequal size, so that group g has size M g , with g = 1, 2, . . . , G. In this case, group averages in (3.1) are based on M g observations. By applying recursively the theorem for block matrix inversion -see Bernstein (2005) -it is easy to see that in the case of blocks of unequal size, a block-wise structure for still implies a block-wise . In the case of blocks with unequal size, a convenient representation of can be obtained using selection matrices. (3.9) Then can be extracted as follows: Here, S is an N × GM max matrix of zeros and ones, selecting the correct number of rows and columns for each block in M max , depending on the group size. Note that SS = I NT , and rewrite where I GM max is a GM max × GM max identity matrix. Using the matrix inversion lemma, we obtain and S S is a diagonal GM max -dimensional matrix of zeros and ones. 1 Steps 1 and 2 outlined above can still be carried to obtain G and G , where now TM g observations are used to calculateγ g . The resulting G can then be plugged into (3.12) and (3.13). From (3.12) and (3.13), it can be seen that consistency and sparsistency of the resulting estimator continue to hold with rates that now depend on N, G and M max .

Allowing for general intra-block correlation structure
The approach outlined in Section 3 can be extended to allow for a general, intra-block correlation matrix at the expense of reducing the computational efficiency. Suppose that for all i, j ∈ g = 1, 2, . . . , G, (3.14) Under this framework, while the covariance between variables of different blocks is constant for all variables belonging to the same block, the intra-block covariance is allowed to vary across variables. In this case, the covariance matrix can be written as where is a block-diagonal matrix. We can show that, under the condition that (1/M) k∈g π ik ≈ 0 for all i, the matrix can be estimated by the covariance matrix of y it −ȳ gt for each block. This results in a relatively easy implementation, whereby we first calculateȳ gt and apply the block-GLASSO outlined in Section 3 to compute G . Hence, we calculate the deviations of each value y it from its corresponding group-level average, namely y it −ȳ gt , and we apply the conventional GLASSO to all y it −ȳ gt for each block, separately. This approach requires that π ij , namely the deviations of σ ij from σ gg , are not too large, so that theȳ gt can be used to consistently estimate σ gg . The computational complexity of this procedure rises to O(G 2 ) + O(GM 2 ), as it is necessary to estimate G blocks of size M. In the rest of the paper, we refer to this approach as the flexible block-GLASSO.

MONTE CARLO EXPERIMENTS
In this section, we provide Monte Carlo evidence on the properties of the above estimation procedure. We consider the following data-generating process: i = 1, 2, . . . , N; t = 1, 2, . . . , T . (4.1) Here In generating x it , we set x i,−20 = 0 and discard the first 20 observations to reduce the effect on estimates of initial values of x it . To generate , we start from G = −1 G and assume that its elements, θ gh,G ∼ Bin(1, (3/G)) for g, h = 1, . . . , G. We obtain and by applying (2.6), where we assume γ G ∼ U (0.2, 0.5). Letting D be the Choleski decomposition of , namely = DD , we generate e t = D t , where t = ( 1t , 2t , . . . , Nt ) , with it ∼ IDN(0, 1). We generate X following the same procedure. As for β, in a first set of experiments we set β = 0, and apply our methodology to y it , to test our procedure when there is no uncertainty regarding the mean of y it . We then set β = 1 and apply our methodology to regression residuals after estimating β by OLS. As a robustness check, we carry an additional experiment where errors are non-normally distributed. In this case, when generating e it , we set it = (u it − 1)/ √ 2, with u it ∼ χ 2 1 . Model (4.1)-(4.2) has strictly exogenous regressors, an assumption that may not hold in practice. In a further set of experiments, we also consider a dynamic set-up, where we assume that y it is generated by the first-order autoregressive model 2, . . . , N; t = 1, 2, . . . , T , (4.3) where all elements are generated as above, and λ = 0.4. Finally, we examine the performance of the more general flexible block-GLASSO approach outlined in Section 3.2 when has a general intra-block correlation structure. Under this experiment, all parameters are the same as in (4.1) and (4.2), with β = 1 and for all i, j ∈ g = 1, 2, . . . , G, (4.4) (4.5) We generate each block in by assuming that its inverse has elements distributed as Bin(1, (3/M)).
In each experiment, we compute the block-GLASSO and the conventional GLASSO, for all pairs of N and T with N = 50 and 100 and T = 10, 50 and 200. As for the choice of G, we try G = N/2, and N/5. Each experiment is replicated R = 250 times. We also carry out another set of experiments with N much larger than T , and set N = 500, 1,000 and 2,000 and T = 20. In this set of experiments, given the computational difficulties and poor performance in computing conventional GLASSO for such large networks, we only provide results for the block-GLASSO. Under the dynamic set-up (4.3), we only run experiments for large T (i.e. T = 50 and 200) to avoid incurring bias of the OLS estimator for short panels. 2 A number of statistics are used to assess the performance of our graph estimators. In terms of recovery of the network structure (provided by the non-zero coefficients in ), we consider the receiver operating characteristic (ROC) curve, which plots the true positive rate (percentage of non-zeros, i.e. links, correctly estimated as non-zero) versus the false positive rate (percentage of zeros incorrectly estimated as non-zeros), as the tuning parameter, ρ G , varies. We summarize ROC curves by providing the maximum F1 score and the area under the curve (AUC), both averaged across the R replications. The F1 score is defined by (2TP/(2TP + FN + FP)), with TP, FP and FN being the true positive, the false positive and the false negatives (number of nonzeros incorrectly detected as zeros), respectively. In terms of estimation of the precision matrix, we report the average entropy loss (EL) and the average Frobenius loss (FL), defined by (4.7) When computing EL and FL, we use the rotation information criterion (RIC) -see Lysen (2009) -to select the optimal regularization parameter (and associated optimal precision matrix). Only for selected combinations of N and T , we also provide graphs with the ROC curves. As for β, we  report bias, root mean square error (RMSE), empirical size and power of the OLS estimator of β and the feasible generalized least-squares (GLS) estimator implemented usingˆ as estimate of . In computing the empirical size, we set the nominal size to 5%, while in calculating the power we assume as an alternative hypothesis H 1 : β = 0.95.

Results
The results are summarized in Tables 1-6 and Figures 1-2. The results from Table 1 show that, when data have block-wise dependence structure, our method greatly outperforms the conventional GLASSO for all combinations of N, T and G. In particular, the F1 score and AUC show that block-GLASSO has higher true positive rates and substantially lower false positive rates, while the EL and FL are always lower for block-GLASSO, indicating that the latter provides a better estimation of the precision matrix. However, it is interesting to note Table 3.

Note:
In calculating the empirical size, the nominal size is set to 5%, while in calculating the power we assume as an alternative hypothesis H 1 : β = 0.95. F1 is the F1 score, AUC is the area under the ROC, EL is the average EL in (4.6) and FL is the average Frobenius loss in (4.7).

Note:
In calculating the empirical size, the nominal size is set to 5%, while in calculating the power we assume as an alternative hypothesis H 1 : β = 0.95. F1 is the F1 score, AUC is the area under the ROC, EL is the average EL in (4.6) and FL is the average FL in (4.7). that when T = 10 and G = N/2 the block-GLASSO does not perform well relative to other cases, and its properties are much worse than the case T = 10 and G = N/5. More generally, Tables 1 and 2 show that for the same pair of N and T , the properties of block-GLASSO deteriorate as G rises, thus confirming our theoretical results that, holding N and T fixed, the estimation error is higher when G is large or, equivalently, M small. This result is also confirmed by Figure 1, showing the ROC curves for the block-GLASSO for varying N, T and G. As expected, the performance of the estimator improves as N increases (and hence M) for fixed T and G, and as T increases for fixed N and G, while it deteriorates as G rises, holding N and T constant. Table 3 reports the small-sample properties of OLS and GLS estimators as well as of the block-GLASSO. As expected in the case of cross-sectionally correlated regression errors, the OLS estimator, while having a bias comparable to that of the GLS, has higher RMSE and is oversized for all combinations of N, T and G. Hence, ignoring the network leads to severe overrejection of the null hypothesis. Looking at the GLS estimator, its empirical size is close to the nominal size of 5% in most cases, although some size distortions can be observed when T = 10 and G = N/2, namely, for short panels characterized by the presence of many, small groups. In fact, under this case the block-GLASSO does not perform well, having small F1 and AUC and large EL and FL, thus confirming our asymptotic results reported in Section 3. Similar results can be observed in Table 4 for the case where the dependent variable is generated by the firstorder autoregressive model (4.3). Under non-normal errors (Table 5), the block-GLASSO still performs well in detecting the network, as confirmed by F1 and AUC values similar to those reported in Table 1, although its EL and FL are much higher than in the normal counterpart. Table 6 shows results when the error covariance matrix displays general intra-block variation (see (4.4) and (4.5)). It is interesting to observe that the empirical size of the GLS estimator of β when ignoring the intra-block variation (block-GLASSO) is in some cases still close to the nominal value of 5%. The GLS estimator based on the more general procedure (flexible Table 6. Properties of GLS estimators of β and of the flexible block-GLASSO applied to regression residuals in model (4.1)-(4.2) with general intrablock variation, β = 1.

Note:
In calculating the empirical size, the nominal size is set to 5%, while in calculating the power we assume as an alternative hypothesis H 1 : β = 0.95. F1 is the F1 score, AUC is the area under the ROC, EL is the average EL in (4.6) and FL is the average FL in (4.7). block-GLASSO) shows a good performance only for smaller values of G, perhaps because under small G (and hence large M) the covariance ofȳ gt better approximates the part of the covariance that is block-wise. We also remark that the more flexible procedure is computationally much slower than the block-GLASSO. Figure 2 shows the ROC for the flexible block-GLASSO and the conventional GLASSO, as well as the group LASSO by Yuan and Lin (2006). The use of a group penalty in the group LASSO encourages the recovery of the block structure, although it does not impose it as in the block-GLASSO. Because the group LASSO has been developed in the context of regression analysis, we apply it to our model as a neighbourhood selection problem for each node of the network. It is interesting to see from Figure 2 that the group LASSO approach performs less well than the block-GLASSO, but slightly better than the conventional GLASSO, as the latter does not use any a priori information about the blocks.

AN EMPIRICAL EXAMPLE: SPATIAL SPILLOVERS IN REGIONAL GROWTH AND CONVERGENCE IN EUROPE
We use block-GLASSO for estimating a growth equation in per-capita gross value-added and for testing for economic convergence of European regions. The debate on whether there exists convergence in per-capita input and income across nations is still open, with results obtained that differ depending on the sample period and the regions included, as well as the estimation methods adopted. A number of authors have highlighted the importance of incorporating spatial effects when studying economic growth and regional convergence and have proposed the use of spatial econometric techniques; see, among others, Rey and Montouri (1999), Ertur and Koch (2007) and Cuaresma and Feldkircher (2013). Spatial dependence in regional economic growth is likely to arise from technology spillover across neighbouring regions and from factor mobility, as well as from the presence of spatial heterogeneity (Rey and Montouri, 1999). In the presence of spatial dependence in economic growth data, if ignored, estimates of the speed of income convergence across geographical regions will be biased. We contribute to this literature by estimating a growth equation with spatial spillovers and we use the block-GLASSO procedure to estimate the spatial weights matrix. We use data on gross value-added per worker (GVA) for 1,088 NUTS3 observed over the period 1980-2012 in 14 European countries. 3 The NUTS classification is a hierarchical system for dividing up the economic territory of the European Union (EU) for the purpose of socio-economic analysis of the regions and design of EU regional policies. It subdivides the EU territory into regions at the three different levels, NUTS1, NUTS2 and NUTS3, moving from larger to smaller geographical units.
Standard neo-classical growth models state that countries will converge to the same level of per-capita income in the long run, independently of initial conditions, as long as there are diminishing returns to capital and labour and perfect diffusion of technology. Under this framework, poorer countries and regions grow faster than richer countries and a negative relationship between average growth rates and initial income levels is expected. Let y i,t+k = ln(GVA i,t+k /GVA it ) be the growth in per-capita GVA (expressed in euros at 2005 prices) for the NUTS3 region i over a set of non-overlapping time intervals of length k. Our empirical model is the Gaussian CAR model for y it E(y i,t+k |y j,t+k , j = 1, 2, . . . , N, j = i) = α + β ln(GVA it ) where we set k = 3. Hence, a negative coefficient attached to the variable ln(GVA it ) indicate that NUTS3 regions with a low initial level of income grow faster than regions with higher initial levels of income, supporting the hypothesis of absolute convergence. The use of nonoverlapping time intervals is common practice in the cross-country growth literature, as this would decrease the influence of short-term shocks and business cycles on economic activity, while revealing long-run relationships. Compared to longer time intervals, the use of three-year non-overlapping intervals allows us to keep a sufficient number of observations to exploit the time dimension of panel data. Following existing studies on spatial interaction effects in regional economic growth models, the inclusion of the spatial lag of the dependent variable (growth rate) amongst the regressors in (5.1) aims at capturing the effect of inter-regional flows of labour, capital and technology on growth and convergence; see Rey and Montouri (1999), Ertur and Koch (2007) and Cuaresma and Feldkircher (2013). In (5.1), w ij is the (i, j )th element of an N × N matrix, W, known as the spatial weights matrix, such that w ii = 0. In spatial econometrics, W is often assumed to be known using a priori information (e.g. from economic theory) on how statistical units potentially interact. Spatial weights based on geographical or travel distance, or contiguity have been used for modelling spatial spillovers in the economic growth equation, although this has been pointed out as being unrealistic (Cuaresma and Feldkircher, 2013). In this application, we keep W as unknown and estimate it using our block-GLASSO approach. While the unit of analysis is the NUTS3 region, we take as groups larger geographical areas, given by 80 NUTS1 and then 211 NUTS2 European regions. Other grouping criteria may undoubtedly be suggested, for example by looking at the literature on club convergence -see, among others, Corrado et al. (2005) -or using methods for identifying communities in social networks from the graph modelling literature (Freeman, 1979).
It is interesting to observe that (5.1) and (5.2) for the conditional distribution imply the joint normal distribution (Besag, 1974).
where = (I N − W) −1 , with = diag(σ 2 1 , σ 2 2 , . . . , σ 2 N ) and μ = α + β ln(GVA t ), provided that (I N − W) is invertible and (I N − W) −1 is symmetric and positive-definite. The reverse is also true: namely, if y t ∼ N (μ t , ), where is an N × N positive definite matrix, then also (5.1) and (5.2) hold, with Var(y it |y jt , j = 1, 2, . . . , n, j = i) = θ −1 ii ; (5.5) see Mardia (1988) and Meinshausen and Buhlmann (2006). It follows that the problem of estimating w ij in the CAR model (5.1)-(5.2) is equivalent to determining whether y it and y jt are conditionally independent, i.e. θ ij = 0. Hence, in this application, we estimate W via by imposing a block structure on (and hence on and W). Table 7 offers some descriptive statistics on the variable under study, at the NUTS3 level. It is interesting to observe that the region with the highest level of per-capita GVA (159,936 euros) is the London area, while the region with the lowest per-capita GVA (1,842 euros) is North Portugal, which is also the region with the highest growth in per-capita GVA (47.183%) over the three-year time interval. Table 8 reports estimates of growth equations (5.1) and (5.2). The first column provides OLS estimates ignoring the spatial structure of data, while the second and third columns show GLS estimates where contemporaneous correlation is incorporated and estimated by block-GLASSO. The value of the coefficient of the initial per-capita GVA of NUTS 3 provinces is negative and significant, showing the presence of (absolute) convergence in all regressions. However, when adopting the GLS approach based on the block-GLASSO procedure, the coefficient is smaller, leading to lower speed of convergence towards the steady state, and a longer time necessary for the regional economies to cover half of the initial lag from their steady states, when compared to traditional OLS estimation. Goodness of fit for all regressions is low, ranging between 12% and 13%, indicating that some important factors have not been included in the models.
The lower panel of the table reports the percentage of links, the average path length and a set of centrality measures proposed by graph theory -see Borgatti and Everett (2006) and Freeman (1979) -that are widely used to characterize the compactness of graphs. The average path length is given by the average length of all the shortest paths from or to the vertices in the network, giving an indication of how dense the network is. The graph-level centrality measures are based on three node-level centrality indicators (i.e. degree, closeness and betweenness), which characterize different aspects of the relative importance of each node and are commonly used in the applied literature. 4 All graph-level measures vary between zero and one, and assume their highest value when the graph has a star or wheel shape. Looking at the percentage of links, it emerges that, as expected, the estimated networks are quite dense and connected when using either NUTS1 or NUTS2 as blocks. This is confirmed by the average path length, which is very low, being around 1.6-1.8. However, the graph centrality measures are close to zero, indicating that there is no single region dominating all other regions. This is also evident from Figure 3, which shows the adjacency graph resulting from the estimation of model (5.1)-(5.2) via block-GLASSO where NUTS1 regions are taken as blocks. We do not report the graph when using NUTS2 regions as blocks, because these are too many. It is interesting to observe that the most connected NUTS1 are also the regions with the highest per-capita GVA, namely Greater London, Norway and South Netherlands, while the areas with a smaller number of connections are Northern Ireland and northern areas of the United Kingdom, which are also geographically isolated from the other regions. Also, in most cases, regions from the same country are connected, thus supporting previous studies using geographical contiguity or geographical distance as a metric of distance.

CONCLUDING REMARKS
In the last few years, several methods have been proposed for reducing the dimensionality problem when estimating graphical models. These methods usually exploit a priori information on possible independence between groups of observations. In this paper, we focus on the estimation of a Gaussian graphical model with a large number of variables, where dependence between variables is block-wise because of, for example, a hierarchical or group membership structure. We propose an estimation strategy based on the GLASSO methodology applied to group averages of observations, and we derive the large-sample properties of the proposed estimator. Our Monte Carlo experiments show that our proposed estimator greatly outperforms the conventional GLASSO when data have block-wise dependence. These experiments also show that our procedure is robust to various deviations from block-wise dependence. For example, the method still delivers valid inference when there is some within-group variation, or under non-normal errors. We have shown the usefulness of this procedure on an empirical study of economic convergence of European regions, showing that accounting for block-wise dependence helps us to better estimate convergence parameters. Although there are many examples in economics where the membership is given, in many others this is not true, making the assumption that the block structure is known a priori too restrictive. One interesting extension of this work would be to determine endogenously the inclusion of a unit in a group as well as the size and number of the groups, following the work by Lin and Ng (2012), Bonhomme and Manresa (2015) and Ando and Bai (2016). Future work should also consider a block-wise structure for the covariance matrix of a VAR model, within the setting proposed by Barigozzi and Brownlees (2016) and Abegaz and Wit (2013). Finally, while our approach does not allow us to estimate the covariance matrix arising from one or more common pervasive factors, it would be interesting to study the properties of an estimation procedure that first controls for common pervasive factors and then estimates the network structure using de-factored residuals.

Log-likelihood function for networks with block-wise dependence structure
Let S be the N × N sample covariance matrix based on a sample of size T from the random vector, y: To obtain the log-likelihood function, we first compute simplified expressions for ln | | and Tr(S ), with = −1 and given by expression (2.4). Using results in Magnus (1982),