On sparsity scales and covariance matrix transformations

We develop a theory of covariance and concentration matrix estimation on any given or estimated sparsity scale when the matrix dimension is larger than the sample size. Nonstandard sparsity scales are justified when such matrices are nuisance parameters, distinct from interest parameters, which should always have a direct subject-matter interpretation. The matrix logarithmic and inverse scales are studied as special cases, with the corollary that a constrained optimization-based approach is unnecessary for estimating a sparse concentration matrix. It is shown through simulations that for large unstructured covariance matrices, there can be appreciable advantages to estimating a sparse approximation to the log-transformed covariance matrix and converting the conclusions back to the scale of interest.


Introduction
The covariance matrix * and concentration matrix * = ( * ) −1 of a p-dimensional random vector X play a central role in summarizing linear relationships between variables and appear throughout multivariate analysis (e.g., Muirhead, 1982). When p is small relative to the sample size n, the sample covariance matrix S n and its inverse are useful estimators of * and * , and their statistical properties are well understood. However, some modern applications yield data with p n. Since the performance of statistical procedures is often quantified by their behaviour as n grows large, allowing p n entails a double asymptotic regime in which p grows at least as fast as n. While S n remains elementwise consistent as p and n grow, S −1 n does not exist, and noise accumulation leads to inconsistency in the stronger matrix norms relevant for applications.
For any p × p matrices A and B with eigenvalues λ k (A) and λ k (B) (k = 1, . . . , p), Weyl's inequality implies that where A op = sup x 2 1 Ax 2 is the spectral norm. For eigenvectors ξ k (A) and ξ k (B), an application of the so-called sin θ theorem of Davis & Kahan (1970) gives Thus, under conditions on the eigenvalue spacings, convergence in the spectral norm is sufficient for convergence of eigenvectors and eigenvalues. Implications of spectral norm consistency λ (a,b) with special cases lim b→0 λ (1,b) = λ, lim b→0 λ (−1,b) = λ −1 and λ (0,1) = log λ. These generate the matrix transformations * → * , * → ( * ) −1 and * → log( * ), where the latter coincides (e.g., Battey, 2017, proof of Lemma 1.1) with the definition of L * = log( * ) through a Taylor series expansion of the matrix exponential, * = exp(L * ) = ∞ k=0 (L * ) k /k!. Ideally, one would estimate the sparsity scale, and, in principle, this is feasible using (1) or similar and a likelihood-based criterion. Framing this as a convex optimization problem seems challenging, but in the present paper we assume that a suitable, ideally optimal, sparsity scale has been determined and establish the estimation theory on that basis, with focus on the matrix logarithm in numerical examples.
The following notation is used. The eigenvalues are ordered from largest to smallest so that, for example, the maximum and minimum eigenvalues of a matrix A are λ 1 (A) and λ p (A). We simplify this notation for specific widely used matrices. For instance, the eigenvalues of * are written λ * 1 , . . . , λ * p . The Frobenius norm and elementwise norm of A are written A 2 F and A max , and are defined as trace(A T A) = u,v |a uv | 2 and max u,v |a uv |. If the matrix A is symmetric, the spectral norm simplifies to A op = |λ 1 (A)| and Downloaded from https://academic.oup.com/biomet/article-abstract/106/3/605/5488684 by Imperial College London Library user on 29 January 2020 Sparsity scales and matrix transformations 607 For a vector, x = (x 1 , . . . , x p ) T , the 0 -norm of x is x 0 = p v=1 I 1(x j | = 0), where I 1(E) is the indicator of the event E, taking value 1 if E is true and 0 otherwise. The notation A 0 and A 0 mean, respectively, that A is strictly positive definite and positive semidefinite. The canonical basis vectors for R p are written e 1 , . . . , e p . For a positive integer p, the set {1, . . . , p} is written [p]. Finally, means equality in order of magnitude, and a ∨ b and a ∧ b denote the maximum and minimum of scalars a and b.
Proofs are given in the Supplementary Material, except when their inclusion in the main article brings appreciable insight. To avoid unnecessary technical details, we assume that * is the covariance matrix of a p × 1 random vector X of zero mean.

Estimation under sparsity on a transformed scale
2.1. General discussion Assume that the function f : V + p → V p of § 1 is known or has been estimated. Then is the diagonal matrix of ordered eigenvalues of * and * = (ξ * 1 , . . . , ξ * p ) is the matrix of corresponding orthonormal eigenvectors. The matrixvalued function f is thus defined by the same scalar-valued function applied to the eigenvalues elementwise and yields symmetric matrices by construction.
Write F * = f ( * ). A simple estimator of a sparse F * is the thresholding estimatorF, which entails setting all entries of a pilot estimatorF P of F * to zero if they are below a predefined threshold ν in absolute value. This procedure depends for its success on a skilful choice of pilot estimator. In particular,F P must converge elementwise to F * , as we show in § 2.3. Thus there are two main challenges associated with estimation under sparsity on a transformed scale: given a positive-definite pilot estimatorˆ + of * , under what conditions and at what rate willF P = f (ˆ + ) converge to F * elementwise? Given a spectral norm-consistent (n, p → ∞) estimatorF of F * , under what conditions and at what rate will˜ = f −1 (F) converge to * = f −1 (F * ) in the spectral norm? 2.2. Elementwise perturbation theory for matrix transformations Theorem 1 answers the first of the two questions from the previous paragraph in general terms. The result hinges on the following condition.
Condition 1. As p → ∞, the sequence of minimum eigenvalues of the corresponding matrices * is bounded away from zero.
In the case of a diagonal covariance matrix, we require * to have minimum diagonal entry bounded away from zero as p → ∞, and Condition 1 can be viewed as a generalization of that. A lower bound on λ * p is { p k=1 (λ * k ) −2 } −1/2 = tr( * 2 ) −1/2 so that a sufficient requirement for Condition 1 is that the sum of the diagonal entries of the squared concentration matrix be bounded, although it could diverge with p.
Theorem 1. Suppose that Condition 1 holds. Letˆ + be a symmetric matrix with ordered eigenvaluesλ 1 , . . . ,λ p where, as p → ∞, the sequence of smallest eigenvaluesλ p is bounded away from zero. LetF P = f (ˆ + ). ThenF P is elementwise consistent (n, p → ∞) for F * ifˆ + is elementwise consistent for * as n, p → ∞ and is bounded in absolute value for all 1 k, v p as n, p → ∞, where, for a function g of a complex variable, Res(g, a) denotes the residue of g at a. If this condition is satisfied, then The conclusion F P − F * max ˆ + − * max is true even whenˆ + is not elementwiseconsistent as n, p → ∞, but this implies thatF P is not elementwise consistent either. Rates of convergence of elementwise-consistent estimators of * are discussed in § 2.4. Some important special cases of Theorem 1 are discussed in less abstract terminology after the proof. We henceforth write the sum of the residues from (2) as R k,v .
Proof. A function f of a p × p matrix A satisfies (Kato, 1976, p. 44) where I is the identity matrix and γ A is a simple closed curve lying in the region of analyticity of f and enclosing all the eigenvalues of A in its interior. Then, in an obvious notation, and the deformation theorem (Whittaker & Watson, 1920, § 5.2, Corollary 2), ensures that forγ and γ * lying wholly in γ , which does not cross the imaginary axis, Writing a positive-definite matrix A as QDQ T where D is A's diagonal matrix of eigenvalues and Q is the matrix of corresponding orthonormal eigenvectors, and, by the theory of residues (Whittaker & Watson, 1920, p. 112), The orthonormality of the eigenvectors ensures that their entries are of order 1/p on average. Thus the quadruple sum in (3) causes no accumulation of error as p diverges, and the convergence rate of f (ˆ + ) to f ( * ) is that ofˆ + to * , provided that the R k,v are bounded in absolute value for all 1 k, v p as n, p → ∞.
Corollary 1. Letˆ + be as in Theorem 1, and letL P = log(ˆ + ). Under Condition 1, Downloaded from https://academic.oup.com/biomet/article-abstract/106/3/605/5488684 by Imperial College London Library user on 29 January 2020 Proof. By direct calculation 1 2πi which is unbounded only in the limit of λ * v orλ k tending to zero, which is precluded by the conditions of Corollary 1. The limit asλ The remarks about convergence rates immediately following Theorem 1 apply. The special case of the matrix logarithm makes it clear that the conclusion of, and proof strategy for, Theorem 1 are far from evident a priori in the p n setting. While the simplest proof of Corollary 1 for p < n would show consistency of the eigenvectors and log eigenvalues ofˆ + for those of * as n → ∞, this is obviously unsuitable when p > n, by the discussion of § 1. Since we are only interested in a neighbourhood of the spectra of * andˆ + , we can select for log z in the proof of Corollary 1 any branch of the complex logarithm and have no difficulties due to its multivaluedness.
Remark 1. By a different strategy, Birman & Solomyak (2003, Theorem 8.1) proved that for general self-adjoint operators A and B, and for functions f that are uniformly Lipschitz on R, where with E A and F B being the spectral measures of A and B; this rules out the case of the logarithm and the inverse. With discrete spectral measures, (5) is similar to (3) with its contour integral replaced by the right-hand side of (4).
Another important example of F * is the matrix inverse of * . For the interpretation of elements of a concentration matrix, see Cox & Wermuth (1996, pp. 68-9).
The implication of this result is that a constrained optimization-based approach (e.g., Cai et al., 2016) is unnecessary for estimating a sparse matrix inverse. Consistency as n, p → ∞ is achieved by thresholdingˆ P , as shown in Corollary 3 below. This has appreciable computational benefits.

Consistent estimation of covariance and concentration matrices in the spectral norm
As discussed in § 1, success in many applications depends on covariance or concentration matrix estimators that are consistent (n, p → ∞) in the spectral norm. Such consistency is achievable under sparsity. Accordingly, we assume that a transformation f has been found for which the following holds.
belongs to the class of matrices where C is an arbitrary real constant, 0 q 1, and s(p)/p → 0.
Define a thresholding estimator of F * asF = (F uv ) wherẽ for a pilot estimatorF P and an appropriate choice of threshold ν to be discussed below. The next result follows immediately from Theorem 1 and the literature on thresholded estimators (e.g., Bickel & Levina, 2008, proof of Theorem 1).
Corollary 3. Letˆ + be as in Theorem 1 and such that ˆ + − * max = O pr (r n,p ), where r n,p tends to zero as n, p → ∞. Assume Conditions 1 and 2 and take ν r n,p in (6).
Remark 2. The requirement that r n,p → 0 as n, p → ∞ usually implies the scaling condition (log p)/n → 0. Further details follow Corollary 4.
While Corollary 3 describes the rate at which the tuning parameter may decay as p and n increase, in small samples the constant c in this tuning parameter is relevant and asymptotic results provide no guidance on how to set it. All such sparsity-inducing procedures suffer this limitation. The simplest strategy is to choose a reasonable arbitrary constant such as c = 1 so that, for the pilot estimators considered in § 2.4, one would set r n,p = {(log p)/n} 1/2 under appropriate tail conditions, although one may set c using a measure of predictive performance out of sample. The construction ofˆ + given in Proposition 2 makes calculation ofF almost instantaneous for p 1000, facilitating crossvalidation. It is reasonable in practice, although statistically suboptimal, to decide on a level α of sparsity that one is willing to impose and set the threshold so that only the proportion α of entries that are largest in absolute value are kept. This was done in most of the experiments of § 4. One might instead inspect a plot of the number of matrix entries above ν against ν, and choose a value of ν at which a sharp decrease in negative gradient, an elbow, is observed, if any.
Given a spectral norm-consistent estimatorF of F * as n, p → ∞, the remaining challenge is to establish conditions under which f −1 (F) converges to f −1 (F * ) in the spectral norm, with associated convergence rates. Under smoothness conditions on g = f −1 , and for symmetric matrices A and B, general bounds of the form Downloaded from https://academic.oup.com/biomet/article-abstract/106/3/605/5488684 by Imperial College London Library user on 29 January 2020 are available. In (7), c g , is a constant, g is the derivative of g, and g ∞ = sup x |g (x)|, where the matrix-valued function is defined as the same scalar-valued function applied to the eigenvalues. See Bhatia (1997, p. 320) for references and Bhatia (1997, § X) for other bounds under special conditions on g, A and B. Equation (7) is, however, useless for unbounded functions, which are best studied on a case-by-case basis as in Theorem 2.
Theorem 2. Assume Conditions 1 and 2 and let˜ = exp(L) and˜ = exp(−L), whereL is the thresholded pilot estimator of L * = log( * ), i.e.,L = {T ν (L P uv )}, whereL P = log(ˆ + ) and ν andˆ + are as in Corollary 3. Then under the conditions of Corollary 3, Thus if the sequences of maximum and minimum eigenvalues of L * are bounded as p → ∞, and˜ are consistent (n, p → ∞) for * and * in the spectral norm as n, p → ∞. They are also positive definite by construction.

Pilot estimators for *
We have not discussed construction of elementwise-consistent pilot estimatorsˆ + as n, p → ∞ that have minimum eigenvalues bounded away from zero, which is crucial for application of Theorem 1. A natural choice is the projection, in elementwise loss, of an unconstrained estimatorˆ P onto the cone { : can be rewritten as the semidefinite programming problem min ϑ>0, −δI p 0 and solved using the cvx solver in Matlab. The projection preserves the elementwise convergence rate ofˆ P , as shown in Proposition 1.
In the case of the matrix logarithm, the largest contribution to the finite-sample elementwise error in estimating L * byL P is seen from (3) and (4) to be of order Thus, with the constraint that δ λ * p , which ensures that ˆ + − * max ˆ P − * max , the finite-sample error is minimized in the limit as δ → λ * p . In the absence of information about λ * p , we are forced to choose a small arbitrary constant. Another construction of an elementwise-consistent + is in Proposition 2.
Thisˆ + is a biased estimator of * in finite samples but asymptotically unbiased by the requirement on δ n,p as n, p → ∞. The following corollary of Weyl's inequality elucidates the effect of δ n,p .
The following section describes the implications of the previous results for high-dimensional linear discriminant analysis and linear regression, in which a covariance or concentration matrix is a nuisance parameter. These examples are used in § 4 to assess empirical performance.

Discriminant analysis
One frequently encountered high-dimensional data type involves binary labelled data, y i ∈ {−1, 1}, alongside high-dimensional covariate information x i . Discriminant analysis aims to find a function D(x) of the features that achieves maximal separability between the two classes C + = {i : y i = 1} and C − = {i : y i = −1}, and one way of quantifying separability is the classification risk, where κ = pr(Y = 1) and P ± X is the conditional distribution of X given its class label, + or −. Fisher (1936) suggested to quantify separability using the ratio of between-class variability to within-class variability. Thus maximization of rf(D) with respect to D corresponds to finding the discriminant function such that the proportion of variability explained by the class labels is as large as possible. Although minimization of cr(D) requires knowledge of the conditional distributions P ± X , maximization of rf(D) only requires knowledge, or estimation, of moments.
Downloaded from https://academic.oup.com/biomet/article-abstract/106/3/605/5488684 by Imperial College London Library user on 29 January 2020 The Supplementary Material gives an adaptation of the Fisher (1936) procedure suitable when p n and establishes its theoretical properties when a spectral norm-consistent (n, p → ∞) estimator of * is available. The procedure relies on convex optimization and a simpler strategy exploits the connection to linear regression, indicated by Fisher (1936) and discussed in § 4.3.

Linear regression analysis
Regression coefficients express the dependence of a response variable Y on explanatory variables X 1 , . . . , X p and, in linear relationships, have the simple interpretation of giving the change in E(Y ) for a hypothetical unit increase in X s , say, with the values of the other variables held fixed. Linear regression coefficients are of the form where * YX is the row vector of elements cov(Y , X s ) and * XX = ( * XX ) −1 , with * XX being the matrix of covariances between X 1 , . . . , X p .
Theorem 3. Letˆ be an estimator of * , the covariance matrix of (Y , X 1 , . . . , X p ), and let XX be an estimator of * XX . Define the regression coefficient estimator asβ = (ˆ YXˆ XX ) T whereˆ YX is the first column ofˆ . Then Under regularity conditions, a strong form of consistency ofβ as n, p → ∞ is guaranteed by estimators of * and * XX that are consistent in the spectral norm, as constructed in § 2 assuming the existence of an approximate sparsity-inducing transformation.  Table 1 shows the performance in the spectral and Frobenius norms of four estimators of * : the sample covariance and its thresholded version, and the pilot estimatorˆ + and the matrix 614 H. S. Battey Table 1. Relative estimation errors (×100) in the spectral and Frobenius norms for covariance and concentration matrix estimation Average estimation errors over 100 Monte Carlo replications for covariance matrix estimation, columns 3-6, and concentration matrix estimation, columns 8-10; rounding errors are less than standard errors divided by 10 and the largest standard error for each column is reported at the foot. exponential of the thresholded matrix logarithm ofˆ + , namely˜ = exp(L) whereL = T ν (L P ).
The pilot estimatorˆ + used here is the shrinkage-based pilot estimator with shrinkage parameter δ n,p = {(log p)/n} 1/2 . For thresholding in the matrix logarithmic domain, the tuning parameter is ν = {(log p)/n} 1/2 . That is, a tuning constant c = 1 is chosen in both stages. For thresholding in the original domain the tuning parameter is taken to be ν = c{(log p)/n} 1/2 , with c the value between 0.1 and 3 that minimizes the error. All errors are normalized by * , where the norm is that in which the errors are measured.
For the same * in every experiment, we also consider estimation of * = * −1 . The inverse sample covariance matrix does not exist for p n and that of the thresholding estimator may not exist either, so we compare˜ = exp(−L) to the inverse ofˆ + and toˆ C , the adaptively constrained 1 minimization estimator of Cai et al. (2016). Calculation ofˆ C involves 2p p-dimensional optimization problems, each with more than 4p constraints, which is computationally expensive and limits the extent of the testing. The tuning parameter forˆ C is taken as c{(log p)/n} 1/2 . In the current experiments c = 1 and c = 2 are tested, the latter being the value suggested by Cai et al. (2016). The results for c = 1 are reported, as these are better than with c = 2. The relative estimation errors in the spectral and Frobenius norms are shown in Table 1.
Computation of the matrix logarithm and exponential in high dimensions should not be attempted using the spectral representation. A more accurate method by Padé approximation is implemented in the logm and expm functions of Matlab.

Misspecified sparsity structure
We illustrate here that there can be advantages in using a sparsity scale different from the original when no structure is present.
We assess through simulations the approximation and estimation errors from assuming that * possesses sparsity structure when, in fact, it possesses none, showing empirically that wrongly assuming matrix logarithmic sparsity is roughly equivalent to wrongly assuming sparsity of * in terms of approximation error but, for the situation considered here, is better in terms of estimation error when p n. It is also better than making no sparsity assumptions. By approximation error, we mean the error incurred by replacing the true object with one in which sparsity is imposed on Downloaded from https://academic.oup.com/biomet/article-abstract/106/3/605/5488684 by Imperial College London Library user on 29 January 2020 Average (standard deviation in parentheses) over 100 Monte Carlo replications of: approximation errors, columns 3 and 4, from wrongly assuming sparsity in the original domain and in the logarithmic domain; estimation error of the sample covariance matrix, column 5, which correctly assumes no structure and estimation errors from wrongly assuming sparsity in the original domain and in the logarithmic domain, columns 6 and 7.
the appropriate scale, once transformed back to the original scale. By estimation error, we mean the error incurred by replacing * with an estimator. The experiment is as follows. A matrix of orthonormal eigenvectors is constructed by generating a p × p matrix Z whose entries are independent and standard normal and computing the QR-factorization Z = QR. Then is set equal to Q. This matrix is fixed, and in each of 100 Monte Carlo replications p = 1000 eigenvalues are generated at random from a gamma distribution of shape parameter τ and scale parameter κ, varied as in Table 2. A random unstructured covariance matrix is then constructed as * = T where is a diagonal matrix containing the gamma-distributed eigenvalues. Both * and L * = log( * ) are thresholded so that only the entries above the 0.95 quantile in absolute value are kept. The resulting approximations, sparse in the original and logarithmic domains respectively, are denoted by T ( * ) and exp{T (L * )} in Table 2.
A random sample of size n = p/10 = 100 is drawn from a normal distribution of mean zero and covariance matrix * , and the sample covariance S n is constructed along witĥ + = (1 − δ n,p )S n + δ n,p diag(S n ) where δ n,p = {log(p)/n} 1/2 . The estimation errors of S n , its thresholded version and the matrix exponential of the thresholded value ofL P = log(ˆ + ) are reported in Table 2. Thresholding of both matrices is such that only entries above the 0.95 quantile in absolute value are kept. It is unclear why an erroneous assumption of log sparsity can lead to worse approximation error but better estimation error than an erroneous assumption of sparsity in the original domain. However, the logarithmic transformation lessens the contributions of the large eigenvalues to estimation of the sparsity structure. Large eigenvalues are shrunk relatively more than small ones. This seems sensible since the large eigenvalues ofˆ P and henceˆ + are severely overdispersed. The lower eigenvalues have already been regularized by conversion ofˆ P toˆ + .
A further simulation study, showing the effect of assuming matrix logarithmic sparsity of * when it instead satisfies Condition 2 in the original domain, is in the Supplementary Material.

The Fisher procedure with misspecified covariance structure
While the goal of discriminant analysis is understanding of the data-generating process rather than prediction, performance in terms of the latter provides a way of assessing the proposed H. S. Battey Table 3. Out-of-sample misclassification rate (×100) of an adaptation of linear discriminant analysis and a high-dimensional classifier procedure. We calculate the mean and standard deviation of the missclassification rate from out-of-sample classification. In each of 100 Monte Carlo replications, N + realizations of p = 500 normally distributed features of mean μ + and covariance + are obtained together with N − realizations of p normally distributed features of mean μ − and covariance − . Call the corresponding N + × p and N − × p matrices x + and x − . The entries of the two mean vectors are standard uniformly distributed, and the two covariance matrices are generated as in the first example in § 4.2 with (τ , κ) = (2, 2). The two mean vectors and covariance matrices are drawn anew on each replication. As pointed out by Fisher (1936), discriminant analysis as described in § 3.1 for p much smaller than the sample size is equivalent to using as the linear discriminant the coefficient vector from a linear regression in which the class labels are viewed as response variables and assigned the values Y + = N − /(N + + N − ) for class + and Y − = −N + /(N + + N − ) for class −, while the multivariate normally distributed data for each class are viewed as explanatory variables.
When p is much larger than (N + + N − ), which is the situation here, standard linear regression is not feasible, so we adapt it as outlined in § 3.2 using˜ XX = exp(−L) whereL is the 95% thesholded version ofL P , the latter constructed as in § 4.2 from the sample covariance matrix of x, the matrix whose first N + rows are x + and whose remaining rows are x − . In each replication a separate test set is generated, consisting of 10 observations from the + class and 10 observations from the − class. The new individuals are classified into + and − based on the signs of their linear predictors and the missclassification rates recorded. Table 3 gives the results for different values of N + and N − . Since the procedures of the present paper were not designed for classification, it is not expected that the suggested modification of the Fisher (1936) procedure outperforms other high-dimensional classifiers, but it seems competitive. Table 3 reports the missclassification rates of another high-dimensional classifier (Fan et al., 2012), for which the tuning parameters are chosen by crossvalidation using the author's codes, available at https://uk.mathworks.com/matlabcentral/fileexchange/40047.

A real example
We apply the modification of the Fisher (1936) procedure as described in § 4.3 to the leukaemia data used by Efron & Hastie (2016, § 19.1). The data consist of p = 3571 genetic features for 72 patients, of whom 47 have acute lymphoblastic leukaemia and 25 have acute myeloid leukaemia, a worse prognosis. We remove each patient from the dataset, use the data on the remaining patients to estimate the linear discriminant as described in § 4.3 with theˆ + of § 4.2, and predict the excluded patient's class label using his or her covariate information and the estimated coefficient vector. The missclassification rate is 8/72, much better than random guessing and possibly improvable with a data-dependent tuning parameter. The high-dimensional classification procedure of Fan et al. (2012) with tuning parameters chosen by crossvalidation produced 6/72 misclassifications.