Localized inverse factorization

We propose a localized divide and conquer algorithm for inverse factorization $S^{-1} = ZZ^*$ of Hermitian positive definite matrices $S$ with localized structure, e.g. exponential decay with respect to some given distance function on the index set of $S$. The algorithm is a reformulation of recursive inverse factorization [J. Chem. Phys., 128 (2008), 104105] but makes use of localized operations only. At each level of recursion, the problem is cut into two subproblems and their solutions are combined using iterative refinement [Phys. Rev. B, 70 (2004), 193102] to give a solution to the original problem. The two subproblems can be solved in parallel without any communication and, using the localized formulation, the cost of combining their results is proportional to the cut size, defined by the binary partition of the index set. This means that for cut sizes increasing as $o(n)$ with system size $n$ the cost of combining the two subproblems is negligible compared to the overall cost for sufficiently large systems. We also present an alternative derivation of iterative refinement based on a sign matrix formulation, analyze the stability, and propose a parameterless stopping criterion. We present bounds for the initial factorization error and the number of iterations in terms of the condition number of $S$ when the starting guess is given by the solution of the two subproblems in the binary recursion. These bounds are used in theoretical results for the decay properties of the involved matrices. The localization properties of our algorithms are demonstrated for matrices corresponding to nearest neighbor overlap on one-, two-, and three-dimensional lattices as well as basis set overlap matrices generated using the Hartree-Fock and Kohn-Sham density functional theory electronic structure program Ergo [SoftwareX, 7 (2018), 107].


Introduction.
A standard approach to the solution of generalized eigenproblems (1) F c = λSc, where F and S are Hermitian and S is positive definite, makes use of an inverse factor of S to transform the eigenvalue problem to standard form [27,28].If we have that (2) F ⊥ x = λx with F ⊥ = Z * F Z and Zx = c.
Inverse factors may also be used in the solution of linear systems, either as part of a direct solver or as approximate inverse preconditioners for iterative solvers.We are here motivated by the solution of the self-consistent field equations appearing in a number of electronic structure models such as Hartree-Fock and Kohn-Sham density functional theory.In this context, F is the Fock/Kohn-Sham matrix and the density matrix is given by a subset of the eigenvectors of (1) as where nocc is the number of occupied electron orbitals and the eigenvalues are ordered in ascending order, The corresponding density matrix for the standard problem (3) is given by ( 5) which is related to (4) by D = ZD ⊥ Z * .In self-consistent field calculations, S is the basis set overlap matrix, in other contexts referred to as the mass or Gram matrix.Some of the most efficient methods to compute the density matrix assume the eigenproblem is on standard form [20,32,36,43,48]. Unless S = I (corresponding to an orthogonal basis) the computation of the density matrix then consists of three steps [42]: Density matrix method to compute D ⊥ from F ⊥ 3: Congruence transformation D = ZD ⊥ Z * A simple and efficient method for the second step sees the problem as a matrix function D ⊥ = θ(µI − F ⊥ ) where θ is the Heaviside function and µ ∈]λ nocc , λ nocc+1 [ and uses a polynomial expansion in F ⊥ to compute D ⊥ .The computational kernel is matrixmatrix multiplication for which implementations achieving good performance usually exist regardless of computational platform.For systems with nonvanishing eigenvalue gap at µ and local basis sets, the computational complexity with respect to system size can be reduced to linear, O(n), with control of the forward error [41].The foundation for linear scaling methods has been discussed extensively in the literature, see e.g.[1,9,21,22].Of particular importance are the decay properties of the density matrix [1] that are the basis of sparse approximations where only O(n) matrix entries are stored [25,29,38].Here we note that to take advantage of the decay properties in the three-step approach outlined above, an inverse factor Z that allows for a sparse approximation with O(n) entries must be used.
There is an infinite number of matrices Z that are inverse factors of S, i.e. that fulfill S −1 = ZZ * .The ones most commonly used are the inverse square root (Löwdin) [17,26,49] and inverse Cholesky factors [7,40], both of which have the important property of decay of matrix element magnitude with atomic separation [1].The cost of standard methods for their computation in general scales cubically with system size.A linear scaling alternative is based on iterative refinement [33] which, given a starting guess Z 0 , produces a sequence of matrices Z 1 , Z 2 , . . .with convergence to an inverse factor of S if the initial factorization error (6) I − Z * 0 SZ 0 2 < 1.
The iterative refinement converges to the inverse square root if Z 0 commutes with S and to some other inverse factor otherwise.The starting guess is often set to the identity matrix, scaled so that ( 6) is fulfilled [17,49].Being based on matrix-matrix multiplication the iterative refinement approach has advantages similar to those of the recursive density matrix expansions for the second step in the three-step approach described above.Recursive inverse factorization, considered in the present work, also makes use of iterative refinement but does so in a hierarchical manner [37].A binary partition of the index set {1, . . ., n} of S is applied corresponding to a binary principal submatrix decomposition (7) S = A B B * C , inverse factors Z A and Z C of A and C, respectively, are computed and ( 8) is used as starting guess for iterative refinement.This binary partitioning is applied recursively which gives the recursive inverse factorization method.The initial factorization error depends on the binary partition of the index set and one may thus attempt to partition the index set so that the initial error becomes as small as possible.It was shown in [37] that any binary partition or cut of the index set in two pieces leads to (6) being fulfilled.The index set may for example correspond to the vertices of a graph or centers of atom-centered basis functions.In this article, the terms vertex and index are used interchangeably.
In the present work, we further analyze the recursive inverse factorization method and propose a localized version.This variant exhibits more localized computations and is even more amenable to parallelization.Under certain assumptions, including S being localized with respect to some given distance function on its index set and provided that matrix entries with magnitude below some fixed threshold value are discarded, we show that using this localized inverse factorization the workload for the iterative refinement used to glue together Z A and Z C is proportional to the cut size.Thus, provided that a principal submatrix cut can be done so that only a small number of vertices, e.g.o(n), are close to the cut, the cost of the iterative refinement is negligible compared to the inverse factorizations of A and C for sufficiently large systems, e.g.o(n) versus O(n).In case S is the overlap matrix for a local atomcentered basis set, the distance function may, with our formulation of localization, be taken as the Euclidean distance between basis function centers.In this case it is usually straightforward to make a binary division such that only o(n) vertices are close to the cut.The two subproblems to compute inverse factors of A and C are completely disconnected and thus embarrassingly parallel.
In Section 2 we propose a sign matrix formulation for inverse factorization.This formulation is in Section 3 used in an alternative derivation of iterative refinement, making the relation to sign matrix and density matrix expansion methods evident.The stability of both regular and localized iterative refinement is considered and new stopping criteria are proposed.In Section 4 we introduce the binary principal submatrix decomposition and regular and localized iterative refinement algorithms including starting guesses given by the binary principal submatrix decomposition.We also derive convergence results, giving a bound for the number of iterative refinement iterations.In Section 5 we introduce the notion of exponential decay with respect to distance between vertices and exponential decay away from the cut.We derive localization results for the matrices occurring in regular and localized iterative refinement.In Section 6 we present the full recursive and localized inverse factorization algorithms.In Section 7 we present numerical experiments to demonstrate the localization properties of the recursive and localized inverse factorization algorithms.We end the article with concluding remarks in Section 8.

2.
Inverse factorization from a sign matrix formulation.We present here a sign matrix formulation that we will use to derive methods for the iterative refinement of inverse factors.
Theorem 1.Let S be a Hermitian positive definite matrix and assume that Q is a nonsingular matrix.Then, Since congruence transformation preserves positive definiteness, see e.g.[16, Theorem 4.5.8],Q * SQ is positive definite.The matrices QQ * S and Q * SQ have the same eigenvalues since is positive definite, X has no eigenvalues on the imaginary axis, and sign(X) is defined.Eq. ( 9) with The special case of Theorem 1 with Q = I leading to Z = S −1/2 was shown by Higham [12].This special case already shows that methods for the matrix sign function can be used to compute the square root together with its inverse.Theorem 1 can be used to reduce the computational cost of the matrix sign function evaluation if an approximate inverse factor is available or can be cheaply obtained.This approximate inverse could be such that the condition number of the problem is reduced and/or such that only a local portion of the inverse factor needs to be updated.In general, Z will not be the inverse square root.However, if Q is Hermitian and commutes with S, then Q and S are simultaneously diagonalizable, i.e.Q = V Λ Q V * and S = V Λ S V * with unitary V and diagonal Λ Q , Λ S , and Theorem 1 is closely related to and can be shown using [14,Lemma 4.3].We note that Z may be computed using some method for the inverse square root applied to Q * SQ followed by multiplication with Q from the left.We note also that the eigenvalues of X are real and given as positive-negative pairs λ 1 , −λ 1 , λ 2 , −λ 2 , . . . .

3.
Iterative refinement from sign matrix methods.Based on the sign matrix formulation above we provide here an alternative derivation of the iterative refinement method of [33].The second order refinement can be derived from the Newton-Schulz sign function iteration [13] (14) This gives the iteration The structure of X i , i = 0, 1, . . . is preserved and therefore only a single channel is needed for the iteration: Higher order polynomial iterations can be derived using the condition that the polynomial has fixed points and a number of vanishing derivatives at −1 and 1.These can be written as and m ≥ 1.The sign matrix residual is in each iteration reduced as (20) This leads to the iterative refinement is the factorization error in iteration i.Note that (17) is the special case of (22) with m = 1.We have that 2m+1 k=m+1 |c k | = 1 and therefore the iterative refinement converges if δ 0 2 < 1.This also means that (25) and that an accuracy δ k 2 < ε is reached within An alternative to (17) is given by applying the Newton-Schulz iteration with ( 27) which gives a coupled or dual channel iteration (28) This iteration has previously been considered with Q = I.Since I commutes with S, Y i and Z i then converges to the matrix square root and its inverse, respectively, as explained in the previous section.The corresponding higher order iterations can be obtained by inserting (27) in (18).This alternative is not further pursued in this work.We note that in the present context where the inverse factor does not have to be the inverse square root a drawback of (28) compared to (17) is that any accuracy lost during the iterations cannot be recovered.The polynomials in (18) can be seen as special cases of the Padé recursions for the matrix sign function [18].Other Padé recursions may also be used in the present context but are not further considered in this article.We note also that the Newton-Schulz iteration in ( 14) is equivalent to the McWeeny polynomial [30] that has been frequently used in computations of the density matrix.Similarly, the higher order polynomials in (18) correspond to the higher order Holas polynomials [15].Other alternatives not further pursued in this work includes the use of nonsymmetric polynomials in the expansion as in for example the SP2 algorithm for the density matrix [32] and the use of scaling techniques to reduce the number of iterations [19,36].
3.1.Local refinement.If only a local correction to the approximate inverse factor is needed the cost of the iterative refinement can be reduced.Assuming that δ i has o(n) non-negligible elements its computation using (24) will for large systems involve the computation of many matrix elements that are negligibly small.The factorization error can be written as was noted in [37], or as an update to the previous step factorization error which gives a dual channel iteration (31) In practice, the iteration is evaluated as where the soft brackets indicate the order of evaluation.Iteration (32) is a key component of the localized inverse factorization proposed in this work.Its localization properties will be discussed later in this article.
3.2.Stability.If Z 0 were assumed to be Hermitian and to commute with S and exact arithmetics were used the order of the matrices in the matrix products of ( 17) and ( 22) would be arbitrary.In practice numerical errors cause loss of commutativity which for some iterations results in instabilities leading to unbounded growth of errors.This effect has been studied in several papers for different variants of (17) and for other matrix iterations.For example, the iterations (33) considered in [17,34] and [6,35] respectively are both equivalent to (17) in exact arithmetics.However, both (33) and (34) are unstable unless S is extremely well conditioned [6,34].Iteration (17) has been shown to be locally stable around S −1/2 [17].
Here we will first consider the stability of ( 22) around any inverse factor Z and for any m.Then we will consider the stability of the local refinement given by (31).Given a matrix function f , the Fréchet derivative L f (X; E) at X in direction E is a mapping, linear in E, satisfying for all E [13].We follow [8,13,14] and define a matrix iteration X i+1 = f (X i ) to be stable in a neighborhood of a fixed point X = f (X) if L f (X; E) exists at X and there is a constant c such that max E =0 L p f (X;E) E ≤ c for any p > 0. Here, the pth power L p f (X; E) is used to denote p-fold composition of the Fréchet derivative in the second argument, e.g.
be the mapping associated with (22).We note that although Z * SZ = I implies Z = g m (Z), the converse is in general not true.For example, if Z * SZ = I, then both Z = g 2 (Z) and αZ = g 2 (αZ) with α = 7/3 are fixed points of g 2 , but (αZ) * S(αZ) = I.Let Z * SZ = I.Then the Fréchet derivative at Z, , and thus has bounded powers.Let (38) h be the mapping associated with (31).Let Z * SZ = I.Then the Fréchet derivative at .
We have that L hm (Z, 0; E, F ) is idempotent if F is Hermitian.Thus, ( 31) is stable if non-Hermitian perturbations of δ are disallowed, which can simply be achieved by storing only the upper triangle of δ.
The stability properties of the matrix iterations above are demonstrated in Figure 1.Again, we follow [13] and apply the iterations to the Wilson matrix For the local refinement without symmetric storage of δ, an unsymmetric perturbation in δ 0 causes a small drift away from the initial inverse factor.However, the factorization error stays small.

Stopping criterion.
In [24] we proposed a new type of parameterless stopping criteria for iterative methods.These stopping criteria were originally used in recursive polynomial expansions to compute the density matrix.The stopping criteria are based on the detection of a discrepancy between theoretical and observed orders of convergence, which takes place when numerical errors start to dominate the calculation.In other words, given some error measure or residual, the theoretical worst case reduction of the error is derived.If the error decreases any slower than this theoretical worst case reduction the calculation has reached the stagnation phase, and it is time to stop the iterations.
For the iterative refinement ( 22) a worst case error reduction is given by (25).We may therefore formulate our stopping criterion for the iterative refinement as stop as soon as δ i+1 2 > δ i m+1 2 .
3.3.1.Frobenius norm.The spectral norm can be expensive to compute in practical calculations.In particular, near the iterative refinement convergence the eigenvalues of δ i may be clustered near 0, making it difficult for an iterative eigensolver to compute the spectral norm.One may therefore want to use a cheaper alternative in the stopping criterion.The computational cost of the Frobenius norm is independent of the eigenvalue distribution and requires just one pass over the matrix entries.The Frobenius norm is equal to where {λ j } n j=1 are the eigenvalues of δ i ordered so that Following the discussion in [24] we have that if then in exact arithmetics, using (25), we have that ( 44) , and the stopping criterion for the iterative refinement can be formulated as stop as soon as δ i+1 F > δ i m+1 F . Now we will show that the condition (43) is fulfilled.Let Then, β j ≤ β n , j = 1, 2, . . ., n − 1 and we have Moreover, the spectral norm can be written as Then, we obtain 4. Binary principal submatrix decomposition.The basic component of our recursive and localized inverse factorization algorithms is a binary principal submatrix decomposition of S. Let S be partitioned as and let Z * A AZ A = I and Z * C CZ C = I be inverse factorizations of A and C, respectively.Then, an inverse factor of S can be computed using iterative refinement with ( 49) as starting guess, as described by Algorithm 1.To be able to strictly bound the number of iterations in exact arithmetics, we let Algorithm 1 not make use of the parameterless stopping criterion which relies on numerical errors to decide when to stop.
In practical calculations, the parameterless stopping criterion is preferable and will be used in the formulation of the full recursive and localized algorithms in Section 6.
Algorithm 1 Iterative refinement end while 10: return Z i 11: end procedure Algorithm 1 as is already features localized computations.The foregoing computation of Z A and Z C can be performed as two separate computations without any interaction or communication between them.The following iterative refinement employs matrix-matrix multiplications for which implementations with good performance usually exist, both for serial and parallel execution.Furthermore, the matrices involved in the algorithm are typically sparse with localized nonzero structure.If the sparse matrix-matrix multiplications are performed using the locality-aware parallel block-sparse matrix-matrix multiplication of [39], communication can be further reduced.
We will now introduce modifications to Algorithm 1 to further improve its localization features and avoid both computation and communication that is unnecessary.Our localized refinement is given by two modifications of Algorithm 1. Firstly, we make the observation that and use (51) for the computation of δ 0 .Secondly, we use the local version of iterative refinement given by (32).Our localized iterative refinement is given by Algorithm 2.
Although Algorithms 1 and 2 are mathematically equivalent, their cost of execution and numerical behavior is different.In the localized refinement the factorization errors I − Z * A AZ A and I − Z * C CZ C are taken as zero and the factorization error δ i+1 Algorithm 2 Localized refinement 1: procedure local-refine(S, Z 0 , ε) while δ i 2 > ε do 6: 7: end while return Z i 12: end procedure is in each iteration computed by updating the error δ i from the previous iteration.This means that the algorithm is not capable of correcting for any initial errors in Z A and Z C nor for any errors introduced while updating δ i+1 .This stands in contrast to Algorithm 1 where the factorization error is recomputed in each iteration.Both algorithms are stable though, as shown in Section 3.2.Another drawback with Algorithm 2 is that it requires more matrix-matrix multiplications per iteration.With m = 1, Algorithm 1 and Algorithm 2 make use of 3 and 4 multiplications per iteration, respectively, assuming that the equality (M * i S) = (SM i ) * is used to avoid 1 multiplication in Algorithm 2. Nevertheless, a great advantage of Algorithm 2 is its localization properties.Although both algorithms feature localized computations in some sense, we will see in Section 5 that the localized refinement is superior for large systems with localization in S.
It was shown in [37] that with the starting guess given by (49), we always get an initial factorization error δ 0 2 < 1 and convergence of the iterative refinement, regardless of what inverse factors Z A and Z C that are used.The following theorem is a strengthening of this result giving quantitative insight into how the convergence depends on the eigenvalues or condition number of S.
Theorem 2. Let S be a Hermitian positive definite n × n matrix partitioned as Then, Theorem 2 implies δ 0 2 < 1 and convergence of the iterative refinement in Algorithms 1 and 2. It follows from ( 26) and (56) that for a given level of accuracy δ i 2 < ε the number of iterations is bounded by Proof.Lemma 3 is a specialization of a theorem due to Ostrowski [50, Satz 1].
Proof of Theorem 2. The inequalities (54) and (55) follow directly from Cauchy's interlacing theorem, see e.g.[16].Recall that SZ 0 Z * 0 has the same eigenvalues as Z * 0 SZ 0 and invoke Lemma 3 with A = S and B = Z 0 Z * 0 .This gives where we used that Z 0 Z * 0 = A −1 0 0 C −1 and again Cauchy's interlacing theorem.This gives us the upper bound in (56).The lower bound follows from the fact that I − Z * 0 SZ 0 is a so-called Jordan-Wielandt matrix with positive-negative eigenvalue pairs, see e.g.[47].
Before we present our complete localized inverse factorization algorithm we will theoretically investigate localization properties of Algorithms 1 and 2. In particular, we will under certain assumptions show that the localized refinement involves only operations on matrices with a number of significant entries proportional to the size of the cut that defines the principal submatrix decomposition.d(i, j) = d(j, i), d(i, i) = 0, and  d(i, j) ≤ d(i, k) + d(k, j) hold for all i, j, and k.Let N d (i, R) = {j : d(i, j) < R} be the set of vertices within distance R from i and let |N d (i, R)| denote its cardinality.In case S is a basis set overlap matrix for a basis set with atom centered basis functions, the vertices (indices) correspond to basis function centers and d(i, j) may be naturally taken as the Euclidean distance between the centers corresponding to i and j.
5.1.Exponential decay with distance between vertices.We will say that an n × n matrix S, with associated distance function d(i, j), has the property of exponential decay with respect to distance between vertices with constants c and α if for all i, j = 1, . . ., n with c > 0 and α > 0. We shall in particular be concerned with sequences of matrices {S n } with associated distance functions {d n (i, j)} that satisfy exponential decay with respect to distance between vertices (63) with constants c and α independent of n.
Theorem 4. Let {S n } be a sequence of n × n matrices with associated distance functions {d n (i, j)} and assume that each S n satisfies the exponential decay property for all i, j = 1, . . ., n with constants c > 0 and α > 0 independent of n.Assume that there are constants γ > 0 and β > 0 independent of n such that holds for all i, for any R ≥ 0.Then, for any given ε > 0, each S n contains at most O(n) entries greater than ε in magnitude.Also, the number of entries greater than ε in magnitude in each row and column of each S n is bounded by a constant independent of n.
Proof.For any matrix entry [S n ] ij with magnitude greater than ε, (64) implies (66) which is a constant independent of n.For each vertex i the number of vertices within a constant distance is, due to (65), bounded by a constant independent of n.For any row or column in the matrix the number of entries larger than ε is therefore bounded by a constant and the total number of entries satisfying Theorem 5. Let {A n } and {B n } be sequences of n × n matrices with a common associated distance function d n (i, j) for each n.Assume that for all i, j where c 1 , c 2 , and α are positive and independent of n.Assume that there are constants γ > 0 and β > 0 independent of n such that holds for all i, for any R ≥ 0.Then, the entries of for any α such that 0 < α < α with c independent of n.
Proof.Let ω = α − α and note that α < α.Then, e −α (dn(i,k)+dn(k,j)) e −ωdn(k,j) (74) e −ωdn(k,j) e −α dn(i,j) .(75) So far we have essentially followed the proof of Theorem 9.2 in [1].It remains to show that the sum n k=1 e −ωdn(k,j) is bounded by a constant independent of n.Grouping the summands with respect to distance from vertex j gives | is the number of vertices located at a distance r − 1 ≤ d n (k, j) < r from vertex j.The sum (78) is independent of n and can be shown to converge using the ratio test [46, p. 66].
As already noted, the results of this subsection are closely related to results previously presented for example in [1].See in particular Proposition 6.4 and Theorem 9.2 of [1].A key difference, however, is that in [1] the distance function or metric on the index set I S is assumed to be the geodesic distance function of a graph defined over I S .In this sense the present formulation, where any pseudometric may be used, is more general.However, in [1] a less restrictive condition for the number of neighbors of any node is used, in comparison with (65) and (69).In [1] it is assumed that the maximum degree of the graph, i.e. the largest number of immediate neighbors of any vertex, is uniformly bounded with respect to n.To see that this is a less restrictive condition, consider for example the graph given by a binary tree with maximum degree 3.For any constants γ and β there exist n, R, and i such that |N dn (i, R)| > γR β since max i |N dn (i, R)| grows exponentially with R for large enough n, so that (65) and (69) are violated.However, in calculations with the pseudometric taken as Euclidean distance between atom centered basis functions this is not an issue since the number of basis function centers within a certain distance R will never exceed O(R 3 ).

Exponential decay away from cut.
We will here consider matrices with the property of exponential decay away from a set of indices I ⊆ {1, . . ., n}.The decay may be with respect to row index Note that for i ∈ I A the first term on the right hand side will be zero and the distance to the cut is thus defined as the distance to the closest vertex in I C , and vice versa for i ∈ I C .This is illustrated in Figure 2. We will say that S has the property of exponential decay away from the cut with constants c and α if for all i, j = 1, . . ., n with c > 0 and α > 0. We note that this is equivalent to the elements of S satisfying the four conditions for all i, j = 1, . . ., n.We shall in particular be concerned with sequences of matrices {S n } where each matrix S n is associated with a distance function d n (i, j), a binary partition of its index set, i.e.I An ⊂ {1, . . ., n} and I Cn = {1, . . ., n} \ I An , and the distance to the cut, defined as in (81), (84) Theorem 6.Let {S n } be a sequence of n × n matrices satisfying the assumptions of Theorem 4. Associate with each S n a binary partition of its index set and let d cut n (i) be defined as in (84).Assume furthermore that for each distance R, there are constants γ > 0 and β > 0 independent of n and a function p(n) such that (85) |{i : d cut n (i) < R}| < γR β p(n), i.e. the number of vertices within distance R from the cut is bounded by γR β p(n).Assume also that at least one of and hold for all i, j = 1, . . ., n.Then, for any ε > 0 each S n contains at most O(p(n)) entries greater than ε in magnitude.
Proof.For any matrix entry [S n ] ij with magnitude greater than ε, (86) implies (87) By (85) the number of vertices within distance 1 α ln c ε from the cut is bounded by γ 1 α ln c ε β p(n), where γ 1 α ln c ε β is a constant independent of n.Thus, only O(p(n)) rows or columns may have entries with magnitude greater than ε and by Theorem 4, the number of entries in each row and column with magnitude greater than ε is bounded by a constant.
We will refer to the number of vertices within a given distance R from the cut, i.e. |{i : d cut n (i) < R}|, as the cut size.The function p(n) in (85) describes how the cut size increases with n.Theorem 6 tells us that for a sequence of matrices with exponential decay with distance between vertices and exponential decay away from the cut, the number of significant entries does not grow faster than the cut size.Note that, in general, exponential decay away from the cut alone is not sufficient to reach this result.For example, assume that p(n) = √ n, so that the cut size grows as O( √ n), and that both conditions in (86) are satisfied.Then, O( √ n) rows and O( √ n) columns may have entries with magnitude greater than ε, giving a total of O(n) matrix entries that may have magnitude greater than ε.Theorem 7. Let {S n } and {T n } be sequences of n × n matrices with a common associated distance function d n (i, j) for each n.Assume that for any R ≥ 0, there are constants γ > 0 and β > 0 independent of n such that holds for all i.For each n, let I n be a subset of the index set {1, . . ., n} and let c 1 , c 2 , and α be positive and independent of n.
(i) Assume that for all i, j.Then, the entries of for any α such that 0 < α < α with c independent of n.This bound holds also with U n = T n S n .
(ii) Assume that for all i, j.Then, the entries of U n = S n T n satisfy (94) |[U n ] ij | ≤ ce −α min p∈In dn(j,p) for all i, j for any α such that 0 < α < α with c independent of n.This bound holds also with U n = T n S n .
Proof.Case (i): Similarly to the proof of Theorem 5 we have, again with ω = α − α , that e −α (min p∈In dn(i,p)+dn(k,j)) e −ωdn(k,j) (96) e −ωdn(k,j) e −α min p∈In dn(i,p) (97) where n k=1 e −ωdn(k,j) is a constant independent of n, due to (88), as shown in the proof of Theorem 5.The case with U n = T n S n and case (ii) can be shown in essentially the same way using that min p∈In d n (k, p) + d n (k, j) ≥ min p∈In d n (j, p).
We are particularly interested in the case of a binary partition of the index set and exponential decay away from the cut.The following result shows that multiplication of a matrix with exponential decay away from the cut with a matrix with exponential decay with distance between vertices gives a matrix with exponential decay away from the cut.
Corollary 8. Let {S n } and {T n } be sequences of n×n matrices with a common associated distance function d n (i, j) for each n.Assume that for any R ≥ 0, there are constants γ > 0 and β > 0 independent of n such that holds for all i.For each n, let I An , I Cn be a binary partition of the index set {1, . . ., n} and assume that for all i, j where d cut n (i) = min k∈I An d n (i, k) + min k∈I Cn d n (i, k) and where c 1 , c 2 , and α are positive and independent of n.Then, the entries of for any α such that 0 < α < α with c independent of n.This bound holds also with Proof.Using that exponential decay away from the cut is equivalent to the four conditions (83) the result follows directly from Theorem 7.

Localization in iterative refinement.
In this section we will derive localization results for the matrices occurring in regular and localized iterative refinement.We will under certain assumptions show that while both Algorithms 1 and 2 involve only matrices that satisfy exponential decay with respect to distance between vertices, all matrices constructed in Algorithm 2 also satisfy exponential decay with respect to distance from the cut.Theorem 9. Let {S n } and {Q n } be sequences of n × n Hermitian matrices with associated distance functions {d n (i, j)}.Let each S n be partitioned as and let where Z An and Z Cn satisfy Z * An A n Z An = I and Z * Cn C n Z Cn = I, respectively.Let {Z n } be the sequence of matrices produced by Algorithm 1 or Algorithm 2 with S n , Q n , and a constant ε as input.For each iteration i = 0, 1, . . ., until the stopping criterion δ i 2 ≤ ε is triggered, let be the sequences of matrices corresponding to each of the intermediate matrices occurring in either one or both of Algorithm 1 and Algorithm 2. Assume that for any R ≥ 0, there are constants γ > 0 and β > 0 independent of n such that holds for all i.Assume that S n and Q n satisfy the exponential decay properties for all i, j = 1, . . ., n with constants c > 0 and α > 0 independent of n.Assume also that the condition number of S n , κ n = |λ max (S n )/λ min (S n )|, is uniformly bounded with respect to n.
Then, each of the matrices in (104) through (109) satisfies an exponential decay property on the form for any α such that 0 < α < α with c independent of n, where X n is any of the matrices in (104) through (109).Besides satisfying (113), the matrices in (105) through (109) also satisfy exponential decay away from the cut on the form for any α such that 0 < α < α with c independent of n, where X n is any of the matrices in (105) through (109).
Lemma 10.Let S = A B B * C satisfy the exponential decay property with respect to distance between vertices for all i, j = 1, . . ., n with positive constants c and α and assume that A = 0 and C = 0. Then S also satisfies the exponential decay property with respect to distance to cut for all i, j = 1, . . ., n.
Proof.For all i ∈ I A , j ∈ I C we have that ( 117) and therefore The same bounds clearly hold also for i ∈ I C , j ∈ I A and since all other entries are zero, the exponential decay property with respect to distance to cut is thus satisfied.
Proof of Theorem 9. Since κ n is uniformly bounded, by (57) the number of iterations until the stopping criterion is triggered is also uniformly bounded.All matrices in (104) through (109) are therefore the result of a bounded number of matrix-matrix multiplications and additions.Repeated application of Theorem 5 implies the exponential decay property (113) for each matrix in ( 104 also satisfies an exponential decay property with respect to distance to the cut.Corollary 8 also applies to the matrix δ 0 since it is the result of multiplication of Q * n with (119).In fact, Corollary 8 may be applied to each matrix in (105) through (109) since a matrix with exponential decay away from the cut is involved in each product.Theorem 4 and Theorem 9 imply that for any ε > 0 each matrix in (104) through (109) contains at most O(n) entries greater than ε in magnitude.Furthermore, the number of entries greater than ε in magnitude in each column or row is bounded by a constant independent of n.Let p(n) be a function such that the cut size increases as O(p(n)) with system size n.Then, Theorem 6 and Theorem 9 imply that for any ε > 0 each matrix in (105) through (109) contains at most O(p(n)) entries greater than ε in magnitude.
6. Recursive and localized inverse factorization.Associate with S a binary principal submatrix tree corresponding to a recursive binary partition of the index set {1, . . ., n} of S. This recursive partitioning may continue down to single matrix elements or stop at some higher level.
Given such a binary principal submatrix tree, one can use Algorithm 1 in a recursive construction of an inverse factor.The matrix is split into four quadrants according to the binary partition of the index set, the method is called recursively on the two diagonal submatrices, and then the iterative refinement of Algorithm 1 is used to obtain the inverse factor of the whole matrix.Our recursive inverse factorization algorithm is given by Algorithm 3. Algorithm 3 was first proposed in [37], but includes here the new stopping criterion for iterative refinement presented in Section 3.3.
Our localized inverse factorization, given by Algorithm 4, is obtained in essentially the same way, but makes use of the localized construction of starting guess and the localized iterative refinement of Algorithm 2.
Theorem 2 implies convergence of the iterative refinement for each level of the recursion in Algorithm 3. Furthermore, it follows from (54) and ( 55) that the bound of the number of iterations, needed to reach an accuracy δ i 2 < ε, given by ( 57) holds for all levels in the recursion.

2:
input: Hermitian positive definite matrix S with an associated binary principal submatrix tree.
return Z i+1 15: end procedure 7. Numerical experiments.In this section the localization properties of the localized inverse factorization, i.e.Algorithm 4, are demonstrated.In all experiments, the recursion in Algorithm 4 is continued all the way down to single matrix elements where Z = 1/ √ S. In this way the final inverse factor is, up to differences due to floating point rounding, uniquely determined by Algorithm 4 and the recursive partition of the index set.In the iterative refinement, m = 1 in all numerical experiments.
Note that from a computational point of view it would likely be beneficial to stop the recursion at some predetermined larger block size and use for example the AINV algorithm [4] or one of its variants [2,3,40] to compute the inverse factor at the lowest level.The recursion may for example be stopped when there is no input: Hermitian positive definite matrix S with an associated binary principal submatrix tree.

3:
if lowest level then With the given binary principal submatrix partition S = A B B * C ,

11:
12: return Z i+1 16: end procedure longer enough sparsity to take advantage of localization in S and/or when the inverse factorization at that level will run serially, e.g.due to limited resources, so that the parallel features of the recursive algorithm will anyway not be utilized.
7.1.Simple lattices.Our first set of benchmark systems are chosen to clearly demonstrate the localization behavior for one-, two, and three-dimensional systems.The systems are also such that the results should be relatively easy to reproduce, not relying on auxiliary information, requiring extensive programming effort nor access to a supercomputer.We consider adjacency matrices corresponding to one-, two-, and three-dimensional integer lattices, i.e. a grid with unit spacing between nearest neighbors.Diagonal matrix entries are set to α and matrix entries corresponding to edges between nearest neighbors on the lattice are set to β.In the one-dimensional case this gives a tridiagonal matrix.In the two-and three-dimensional cases the vertices of the lattice are ordered using a recursive binary divide space procedure.At each level of the recursion the vertices are sorted along the greatest dimension and split in two subsets.Unless otherwise stated, we use the set of parameters in Table 1.In the image of Z in the upper left panel for the one-dimensional system the matrix element magnitude clearly decays away from the diagonal which in this case corresponds to decay with distance between vertices.This decay with distance is more difficult to grasp looking only at the images in the upper panels for the two-and three-dimensional systems.However, the lower panels reveals an even faster decay Fig. 3: Localization in Z produced by Algorithm 4 for the lattice 1D, 2D, and 3D benchmark systems, with parameters and dimensions given in Table 1.The upper panels show the matrix Z as an image where the color indicates the magnitude of the matrix elements.The lower panels show the decay of matrix elements with distance between vertices on the lattice, corresponding to each of the upper panels.
with distance for the two-and three-dimensional systems, which is due to the smaller matrix entries corresponding to edges between nearest neighbors determined by the parameter β, see Table 1.
7.1.2.Localization in correction matrices.The complete recursion in Algorithm 4 can be seen as a sequence of corrections to the inverse factor, i.e.Z = r−1 l=0 K l where the subscript denotes the level of recursion with K 0 being the correction at the root level of the recursion and r is the number of recursion levels.The correction matrix K l is a block diagonal matrix, where each block corresponds to a node of the binary tree of recursive calls at level l and is, for l < r − 1, equal to the sum of the matrices M i in the iterative refinement corresponding to this node.The matrix K r−1 is, in our case where we continue the recursion down to single matrix elements, the diagonal matrix given by K r−1 = diag(S) −1/2 .Note that for l > 0, the correction matrix K l spans all branches in the binary tree of recursion at level l.We will use the correction matrices as representatives for the matrices in Algorithm 4 corresponding to the matrices in (105) through (109), which all show similar localization behavior.
Figure 4 shows images of the correction matrices K l for l = 0, 1, 2, 3, for the one-, two-, and three-dimensional lattice benchmark systems.It may be noted that these figures correspond directly to the images of Z in Figure 3.One may for example note that the upper right and lower left quadrants of K 0 are identical to the corresponding submatrices of Z.This is expected as correction to those submatrices only occurs at the root level of the recursion.Again the localization, with rapid decay away from the cut (or cuts for l = 1, 2, 3), is clearly seen in the one-dimensional case, but not as easily seen in the two-and three-dimensional cases.
To more clearly see the localization behavior we plot in Figure 5 the decay with distance between vertices and the decay away from the cut for the root level correction matrices K 0 .Clearly, there is rapid decay both with distance between vertices and away from the cut, which may be compared to the decay in the matrices Z seen in the lower panels in Figure 3.
The key advantage of the localized inverse factorization of Algorithm 4 over the regular recursive inverse factorization of Algorithm 3 is its superior localization properties.Algorithm 3 includes products of matrices that feature localization of the type seen in Figure 3, i.e. exponential decay with distance between vertices.See for example the construction of δ i on line 12 of Algorithm 3 with matrix products involving Z i and S. Algorithm 4, on the other hand, involves only operations, matrix products and additions, where at least one of the matrices feature localization of the type seen in Figures 4 and 5.This means that, if only matrix elements of significant magnitude are included in the calculation, the computational cost of Algorithm 4 may be way lower than that of Algorithm 3.
7.1.3.Localization with increasing system size.So far all calculations have been for the fixed system sizes given in Table 1.We are now interested in how the localization changes with increasing system size.We consider again adjacency matrices corresponding to integer lattices but with increasing dimension.We set the parameters α and β as in Table 1.We consider one-dimensional grids with 8, 16, 32, 64, 128, 256, and 512 vertices, two-dimensional grids with s × s vertices where s = 4, 8, 16, 32, 64, and three-dimensional grids with s × s × s vertices where s = 2, 4, 8, 16.
To see how the localization changes with system size we plot in Figure 6 the number of matrix entries in Z and K 0 whose absolute value is above a given threshold value with increasing system size.For the final inverse factor Z the number of significant matrix entries increases linearly with system size which is consistent with an exponential decay with respect to distance between vertices |Z ij | ≤ ce −αdn(i,j) , i, j = 1, . . ., n with c and α independent of n, see Theorem 4.
For the correction matrices K 0 Figure 6 displays very different behavior for the one-, two-, and three-dimensional systems.This can be understood by considering how the cut size varies with system size.In the 1D case, the number of vertices within any fixed distance R from the cut, i.e. the cut size, is bounded by a constant independent of n.In the 2D case, the cut size increases as O( √ n) and in the 3D case, the cut size increases as O(n 2/3 ).We recall from Theorem 6 that a matrix that satisfies both exponential decay with distance between vertices and away from a cut with cut size p(n) contains at most O(p(n)) entries with magnitude greater than any given constant ε.The nearly perfect least squares fits in the lower panels of Figure 6 indicate that the number of entries in K 0 increases as O(1), O( √ n), and O(n 2/3 ) for the 1D, 2D, and 3D cases respectively.The results in the lower panels Fig. 4: Correction matrices K l , l = 0, 1, 2, 3 as images, extracted from Algorithm 4 for the calculations that produced the Z-matrices in Figure 3. Fig. 5: Localization in the root level correction matrix K 0 for the calculations that produced the Z-matrices in Figure 3.The upper panels show the decay of matrix elements with distance between vertices on the lattice.The lower panels show the decay of matrix elements with distance away from the cut. of Figure 6 are therefore consistent with K 0 satisfying both exponential decay with respect to distance between vertices |(K 0 ) ij | ≤ ce −αdn(i,j) and away from the cut with c and α independent of n.

Alkane chains and water clusters.
We consider here application of the localized inverse factorization to basis set overlap matrices occurring in Hartree-Fock and Kohn-Sham density functional theory electronic structure calculations using standard Gaussian basis sets.The overlap matrices were generated using the Ergo opensource program for linear-scaling electronic structure calculations [45], publicly available at ergoscf.org under the GNU Public License (GPL) v3.In the case of such overlap matrices, each vertex corresponds to a basis function center and we let the distance function d(i, j) be the Euclidean distance between basis function centers i and j.The magnitude of matrix entries decays as |S ij | ≤ ce −αd(i,j) 2 which is even faster than exponential decay [38].In all calculations the standard Gaussian basis set STO-3G was used.
We consider overlap matrices for alkane chains and water clusters.We run calculations with two different sizes for each type of system to be able to see if the exponential decay properties are retained when the system size is increased.The alkane chain xyz coordinates were generated using the generate_alkane.ccprogram included in the Ergo source code package.The water cluster xyz coordinates, originally used in [44], are available for download at ergoscf.org.
Localization results for the alkane chains are plotted in Figures 7 and 8. Figure 7 shows the magnitude of the matrix entries in S −1 and Z as functions of distance between vertices or basis function centers along with an image of Z indicating the magnitude of the matrix entries as in previous figures.Figure 8 shows the magnitude of the matrix entries in K 0 as a function of distance between vertices and as a function of distance away from the cut along with an image of K 0 .The matrix dimensions in Fig. 6: Number of non-negligible matrix entries in the inverse factor Z and the root level correction matrix K 0 as a function of system size, for Algorithm 4 and systems based on one-, two-, and three-dimensional lattices.The upper panels show the number of entries in Z above 10 −6 and 10 −8 .The dashed help lines in the upper panels show linear least squares fits to the data.The lower panels show the number of entries in K 0 above 10 −6 and 10 −8 .The dashed help lines in the middle and right lower panels show c 0 + c 1 √ n and c 0 + c 1 n 1/3 + c 2 n 2/3 least squares fits, respectively.
the 386 atom and 1538 atom cases are 898 and 3586, respectively.These numbers, in contrast to the 1D lattice system considered above, are not powers of 2, which explains why significant matrix entries of K 0 in Figure 8 corresponding to atom centers close to the cut are not located at the center of the matrix.The corresponding localization results for the water clusters are plotted in Figures 9 and 10.The dashed help lines makes it easy to see that the exponential decay properties are retained for all matrices S −1 , Z, and K 0 and for both types of systems.
8. Concluding remarks.Previous work on the computation of inverse factors has to large extent focused on approximations used as preconditioners for iterative solution of linear systems [5].Examples, besides the AINV algorithms already mentioned, include FSAI [23] and more recent variants [11].Methods making use of a recursive partitioning of the matrix is used in direct methods for factorization such as multifrontal methods [10].In this case the matrix is seen as the adjacency matrix of a graph and the matrix is partitioned using a three-by-three block partition corresponding to a vertex separator of the graph.Multifrontal methods have typically been used for e.g. the Cholesky decomposition and not its inverse factor.However, the multifrontal approach could also be combined with e.g. the AINV algorithm for direct computation of the inverse Cholesky factor.A starting guess for iterative refinement is in [31] built up from the inverse factors of overlapping principal submatrices.The convergence has not been proven but this approach could possibly lead to a lower initial factorization error and reduced number of iterative refinement iterations com- pared to our binary partition at the expense of more computations to construct the starting guess.Note though that for large enough systems with localization we have, under certain assumptions, shown that the cost of our localized inverse factorization is completely dominated by the solutions of the subproblems and not the iterative refinement used to glue together their solutions.
A strength of the localized inverse factorization algorithms is the proved convergence for any binary partition of the index set.Inappropriate partitions at worst lead to poor localization and higher computational cost.We used in our numerical experiments a straightforward space-dividing algorithm for the recursive binary partition of the index set.More advanced partitioning algorithms could take into account the magnitude of the entries in S and attempt to minimize the cut size and the initial factorization error.
Our localized inverse factorization algorithm could be combined with some direct inverse factorization method that is efficient for small dense or semi-sparse systems.In the numerical experiments we continued the recursion all the way down to single matrix elements.It would in general be more efficient to stop the recursion as soon as the matrix size is smaller than some predetermined block size and use for example one of the AINV algorithms to compute the inverse factor.Another possibility is to use regular recursive inverse factorization for intermediate levels in the recursion.
We showed in Section 5 that under certain assumptions both the regular and localized iterative refinement of Algorithms 1 and 2 involve only matrices with exponential decay with distance between vertices.However, all matrices involved in Algorithm 2 also have the property of exponential decay away from the cut.This means that whereas the number of significant matrix entries in Algorithm 1 grows not faster than O(n), the number of significant matrix entries in Algorithm 2 also does not grow faster than the cut size.For binary partitions with a cut size increasing as o(n) and large enough systems the localized algorithm is therefore superior to the regular version.Note that we have not provided a rigorous proof for the localization features of the full recursive and localized inverse factorization algorithms demonstrated in Section 7.There are two important differences compared to Algorithms 1 and 2. Firstly, it is difficult to come up with a strict bound for the number of iterations since the stopping criterion relies on stagnation due to numerical errors.Secondly, the computed inverse factor is the result of a sum over r = O(log(n)) correction matrices, where r is the number of levels in the recursion.Clearly, r tends to infinity together with n so that the number of correction matrices in the sum tends to infinity with increasing n, which entails another difficulty.Nevertheless, all our numerical experiments indicate that the localization properties of Algorithms 1 and 2 are inherited by their recursive counterparts.Furthermore, our results indicate that the total computational cost can increase linearly with system size if negligible matrix entries are not stored nor included in the calculation.In summary, the theoretical results of Section 5 together with the numerical experiments of Section 7 make us confident that the localized inverse factorization, especially with regard to localization, represents a dramatic improvement over the regular recursive inverse factorization.
The localized inverse factorization is also well suited for parallelization.The two subproblems of computing inverse factors of the two principal submatrices at each level of the recursion are completely disconnected and can thus be solved in

5 .
Localization.Let d(i, j) be a pseudometric on the index set I S = {1 . . .n} of S, i.e. a distance function such that d(i, j) ≥ 0,

Fig. 2 :
Fig. 2: Illustration of distance to cut.The vertex (index) set {1, . . ., n} is partitioned in two subsets I A (crosses) and I C (circles).For a vertex a in I A , the distance to the cut d cut (a) is defined as the distance to the closest vertex in I C .For a vertex b in I C , the distance to the cut d cut (b) is defined as the distance to the closest vertex in I A .
) through (109).By Lemma 10, the matrix 0 B n B * n 0 satisfies an exponential decay property with respect to distance to the cut.Therefore and by Corollary 8 the matrix

Figure 3
demonstrates the localization in the final inverse factor Z produced by Algorithm 4 for the 1D, 2D, and 3D lattice systems.Note that, in the upper panels showing images of the Z-matrices, a single blue color is used to indicate matrix elements with absolute value smaller than 10 −30 .Although not visible in the figures, the decay of course continues below 10 −30 .

Fig. 7 :
Fig. 7: Left and center panels: Magnitude of matrix entries as a function of distance between vertices (basis function centers) for S −1 and Z computed using the localized inverse factorization.Right panel: Image of the inverse factor Z. Upper panels: Alkane chain with 386 atoms.Lower panels: Alkane chain with 1538 atoms.

Fig. 8 :
Fig. 8: Left and center panels: Magnitude of matrix entries as a function of distance between vertices and as a function of distance from cut for K 0 .Right panel: Image of the correction matrix K 0 .Upper panels: Alkane chain with 386 atoms.Lower panels: Alkane chain with 1538 atoms.

Table 1 :
Parameters used to set up the lattice benchmark systems.