## Abstract

For more than two decades, the number of data and model parameters in seismic tomography problems has exceeded the available computational resources required for application of direct computational methods, leaving iterative solvers the only option. One disadvantage of the iterative techniques is that the inverse of the matrix that defines the system is not explicitly formed, and as a consequence, the model resolution and covariance matrices cannot be computed. Despite the significant effort in finding computationally affordable approximations of these matrices, challenges remain, and methods such as the checkerboard resolution tests continue to be used. Based upon recent developments in sparse algorithms and high-performance computing resources, we show that direct methods are becoming feasible for large seismic tomography problems. We demonstrate the application of QR factorization in solving the regional *P*-wave structure and computing the full resolution matrix with 267 520 model parameters.

## INTRODUCTION

Seismic tomography is the most powerful tool for imaging the interior of the Earth, and many geophysical problems, including seismic tomography, are typically reduced to solving a linear system of equations (e.g. Aster *et al.*2005; Nolet 2008; Rawlinson *et al.*2010). It is desirable to use accurate and robust direct methods or direct solvers (e.g. Demmel 1997) to obtain solutions. However, when the size of these inverse problems is large, the application of the direct solvers becomes computationally difficult. Over the last decade, advances in theory and methodology of forward and inverse problems, as well as increase in computational power and the number of available data, enabled the exploration of massive tomographic problems. For example, data availability has been accelerated with the introduction of dense networks such as the EarthScope USArray project (e.g. Meltzer *et al.*1999), High Sensitivity Seismograph Network in Japan (Okada *et al.*2004) and WOMBAT Seismic Array in Australia (Rawlinson *et al.*2008). These data allow finer spatial resolution of the tomographic models, yielding inverse problems with *O*(10^{5}) − *O*(10^{7}) observations and *O*(10^{5}) − *O*(10^{6}) unknown parameters (e.g. Nolet 2008; Rawlinson *et al.*2010, 2014).

As an alternative, iterative solvers such as the LSQR method (Paige & Saunders 1982a,b) are used to avoid significant memory or computational run time. A disadvantage of the iterative techniques is that the inverse of the matrix that defines the system is not explicitly formed, hence measures of the robustness and the quality of the solution, such as the model resolution and covariance matrices, cannot be obtained. Consequently, a paradoxical situation arises. Although modern tomographic studies with large numbers of data and model parameters produce potentially better images in terms of spatial resolution, robustness and uncertainty of these models become obscured. In other words, improvements cannot be accurately assessed or confirmed.

Different solutions have been proposed to overcome this challenge. Attempts have been made to assess the spatial resolution of the model by means of synthetic reconstruction using checkerboard models or the results of the inversion as inputs (e.g. Rawlinson *et al.*2014). However, such tests are misleading as they consider only the linear combinations of the columns of the resolution matrix (Lévêque *et al.*1993; Rawlinson *et al.*2014). Alternatively, methods based on LSQR can produce an effective resolution matrix (Zhang & McMechan 1995, 1996; Yao *et al.*1999; Berryman 2000a,b; Zhang & Thurber 2007), but this matrix may differ significantly from the true resolution matrix (Deal & Nolet 1996; Berryman 2000a,b; Zhang & Thurber 2007). Nolet *et al.* (1999) present another option of a one-step back-projection method, based upon the Moore-Penrose conditions (Jackson 1972), to obtain an explicit expression of the approximate inverse matrix to compute the full resolution matrix. For this approach, the accuracy of the resulting resolution matrix depends upon certain assumptions about the sensitivity matrix (Nolet *et al.*1999; Yao *et al.*2001). On the other hand, Soldati *et al.* (2006) use parallel Cholesky factorization on the normal equations in shared memory systems, explicitly forming the product of the matrix and its transpose. One issue with such product is that it acquires a high condition number (square of the condition number of the initial matrix which is ill-conditioned), implying that the rounding errors can cause loss of information.

Recently, stochastic matrix probing techniques (Hutchinson 1990; Bekas *et al.*2007) have been introduced to estimate the main diagonal elements of the resolution matrix (MacCarthy *et al.*2011; An 2012; Trampert & Fichtner 2013). This method provides accurate assessment of the diagonal elements, but it does not calculate the off-diagonal elements that are useful in assessing the coupling between different model parameters, that is, in identifying artefacts such as leakage between different parts of the model space. Furthermore, depending upon the structure of the resolution matrix, these methods require a large number of iterations to yield satisfactory approximations (Bekas *et al.*2007). Another approach available for adjoint tomography applications is to utilize the second-order adjoints for extracting resolution information (Fichtner & Trampert 2011). This approach can be considered a generalization of the ray density tensor (Kissling 1988) that quantifies the space-dependent azimuthal coverage, and therefore serves only as a proxy for resolution. It is more efficient than other techniques, but is still costly, and the output is a direction-dependent resolution length, not the full resolution matrix with trade-offs between model parameters.

In this work, we revisit the application of direct methods in solving tomographic problems in view of significant developments in computer science and hardware in the last decade. For example, efficient rank-revealing algorithms based upon QR factorization that takes advantage of new architectures such as memory hierarchy and parallelism now exist (e.g. Davis 2011). Furthermore, new fill-reducing ordering algorithms effectively propagate sparsity throughout common factorizations (e.g. Karypis & Kumar 1998; Amestoy *et al.*2004; Davis *et al.*2004a,b). These advances are beginning to make direct methods feasible for problems with large matrices such as seismic tomography. We demonstrate that they can be used to solve a teleseismic tomography problem and to compute the full resolution matrix. We also show that error propagation and the covariance matrix can be calculated with minimum additional cost.

## METHOD

Seismic tomography is a non-linear and high-dimensional optimization problem where a set of model parameters is sought to satisfy observations and additional *a priori* constraints. In many cases, the problem is locally approximated in the vicinity of an optimal Earth model, and perturbation-based methods are used to acquire successive linear updates to the starting model until a convergence criterion is met (e.g. Rawlinson *et al.*2014). The linearized problem in each iteration is constructed using the forward operator **G** such that

**x**is a model vector with

*m*unknown parameters,

**d**is the data vector containing

*n*observations or the residuals between observations and synthetics and

**G**is an

*n*×

*m*matrix that, in most seismic tomography approaches, is sparse. A solution $${{\bf \hat x}}$$ can be obtained by applying a linear matrix inverse operator,

**G**

^{−}, that is,

The true model parameters **x** and the estimated parameters $${{\bf \hat x}}$$ are related through the model resolution matrix $$\boldsymbol {\mathcal R}$$ as (Jackson 1972; Menke 1989)

Eq. (3) suggests that the retrieved solution can be considered as a projection of the true model. The diagonal elements of $${\boldsymbol {\mathcal R}}$$ describe the independent resolvability of each parameter, and the off-diagonal entries show the trade-offs between model parameters. In the ideal case, $${\boldsymbol {\mathcal R}} = {{\bf I}}$$ and $${{\bf \hat x}} = {{\bf x}}$$, but in real tomographic applications, the resolution matrix deviates from the identity matrix with non-zero off-diagonal elements. Τhe exact form of $${\boldsymbol {\mathcal R}}$$ is related with the geometry of the tomographic problem (e.g. locations of sources and stations), the theory and assumption used for the forward operator (e.g. rays, sensitivity kernels), and the number and type of observations (e.g. arrival time measurements).

## COMPUTATIONAL ASPECT

Commonly, **G**^{−} is chosen such that $${{\bf \hat x}}$$ is a least-squares solution to eq. (1), that is, it is obtained by minimizing $${\| {{{\bf G}} \cdot {{\bf \hat x}} - {{\bf d}}} \|^2}$$, where ‖ · ‖ is the Euclidian norm (e.g. Menke 1989; Aster *et al.*2005; Nolet 2008). This yields the normal equations

In most seismic tomography problems, the forward operator **G** is rank deficient or nearly rank deficient even when *n* > *m*, since some rows are either exactly linearly dependent or nearly linearly dependent in the presence of round-off errors and modelling uncertainties. Consequently, additional conditions are imposed such as some form of Tikhonov regularization operator **L** of size *k* × *m* with *n* + *k* ≥ *m* (Tikhonov & Arsenin 1977). The regularized least-squares problem is the same as eq. (1) except that **G** is replaced by its regularized version **A** and vector **d** by **b** where

*a*is a scalar that controls the weight of the regularization term. The corresponding regularized normal equations are

A potential problem in normal equations approach is that it yields less accurate solutions than direct methods when the initial problem is ill-conditioned because the accuracy depends on the condition number of **G**^{T} · **G**, which is the square of the condition number of **G**. Furthermore, although **G** is typically a sparse matrix, **G**^{T} · **G** is significantly denser and requires additional memory for storage. The most accurate approach in solving both eq. (1) or its regularized version is through singular value decomposition (SVD; Lanczos 1961; Demmel 1997; Snieder & Trampert 1999; Aster *et al.*2005) or generalized SVD (e.g. Hansen, 1990, 1992, 1998, 2007; Aster *et al.*2005). However, as data and model spaces become larger, the computational cost of SVD in floating-point operations, and in particular, memory requirements, makes it intractable. This arises because the sparseness of the numerical null space of **G** is not exploited.

## SPARSE QR FACTORIZATION

In this study, we select QR factorization since it avoids the explicit formation of **G**^{T} · **G**. Although modern QR algorithms can be used to solve rank-deficient problems (Davis 2011; Foster & Davis 2013), we consider the regularized least-squares problem to be consistent with conventional seismic tomography modelling such as that presented in MacCarthy *et al.* (2011). The forward operator **A**_{ is therefore a full-rank, rectangular matrix with more rows than columns, that is, n + k > m. The QR factorization of A decomposes the matrix into a product $${{\bf A}} = {{\bf \tilde Q}} \cdot {{\bf \tilde R}}$$ consisting of an orthogonal, square matrix $${{\bf \tilde Q}}$$ and a trapezoidal matrix $${{\bf \tilde R}}$$. The trapezoidal matrix $${{\bf \tilde R}}$$ has the form $${{\bf \tilde R}} = {[ {{{{\bf R}}^{\rm T}}\,{\bf 0^{\rm T}}} ]^{\rm T}}$$, where R is an m × m upper triangular matrix. The square orthogonal matrix $${{\bf \tilde Q}} = [ {{{\bf Q}}\,\ {{{\bf Q}}_0}} ]$$, consists of the ‘economy’ QR factor Q of size m × m that corresponds to R, and Q0 that contains the remaining n + k − m columns of $${{\bf \tilde Q}}$$. QR factorization solves the least-squares problem through minimization of an equivalent problem (e.g. Björck 1996; Gander et al.2014), that is, }

Since **R** is full rank, the least-squares solution of eq. (8) is the same as the solution to the linear system **Rx** = **Q**^{T} · **b**, and the objective function is at its minimum when $$\|{{\bf Q}}_0^{\rm{T}}\cdot {{\bf b}}\| = {(\| {{{{\bf b}}\|^2} {-} {\|{{\bf Q}}^{\rm{T}}}\cdot {{{\bf b}}\|^2}} )^{1/2}}$$. Similar to SVD, the 2-norm is preserved in QR factorization since $${{\bf \tilde Q}}$$ is orthogonal. Norm preservation implies no amplification of numerical error, ensuring that inaccuracies in **A** or **G** are not amplified (e.g. Demmel 1997).

Various direct solvers exist that take advantage of the sparsity of the **A** matrix (e.g. Davis 2006), but they are considerably more complicated and difficult to implement than their generic counterparts. Among the complications is the need for efficient handling of the fill-in (i.e. non-zero entries) in the factors during the factorization, and to ensure that the sparseness propagates throughout the procedure. For example, appropriate permutation of rows and columns of the input matrix results in the **R** matrix with vast difference in its sparseness. An example shown in Fig. 1 considers a matrix that is formed from an upper rectangular block and a lower diagonal matrix, a typical structure of a tomographic problem with zeroth-order Tikhonov regularization. Only 0.9 per cent of the entries of this matrix are non-zero, but QR factorization of this matrix yields an upper triangular **R** matrix that consists of 46 per cent non-zero entries. A different ordering of the columns of the initial matrix, on the other hand, produces significantly sparser **R** factor with only 9.4 per cent of the factors being non-zero (Fig. 1).

In this study, we use the SPQR algorithm from the SuiteSparseQR package (Davis 2011), a high-performance, parallel algorithm based on the multifrontal method (Duff & Reid 1983). The factorization of large sparse matrices is broken into multiple factorizations of smaller dense submatrices, and the structure of the algorithm has a dendritic organization, suitable for parallel computing. Each node in the tree is assigned to the factorization of a dense submatrix, a frontal matrix. The results of lower order nodes are gathered to assemble frontal matrices of higher order nodes through irregular, one-way data communication that occurs only from the lower to higher order nodes (Davis 2011).

The SPQR algorithm includes the ability to represent and store the orthogonal matrix **Q** in an efficient sparse format using the Householder transformations (Householder 1958; Davis 2011). Nevertheless, a more efficient alternative is to directly compute the product **Q**^{T} · **b** during the calculation of **Q** to avoid storing **Q**, which is an option in the SuiteSparseQR package. We can obtain both the solution vector and the resolution matrix $${\boldsymbol {\mathcal R}}$$ of the regularized problem by using

This formulation has the advantage that both the analysis and the factorization are performed only once. The QR factors are used for retrieving the model estimation that corresponds to the first column of **B**, and the resolution matrix is obtained, in a column-wise manner, from the remaining columns of **B**.

Following this approach, the error propagation can also be performed with minor additional cost. In real applications, the data vector **d** is contaminated with errors due to noise and other uncertainties, and eq. (1) is written as

*e*is the vector containing the errors. The error propagates to the model estimation as

The overall departure of the true model **x** from the estimated model $${{\bf \hat x}}$$ is the result of both limited resolution and error propagation (e.g. Snieder & Trampert 1999), that is,

The error propagation can be calculated with minimum cost by incorporating the vector *e* as an additional column of **B** (eq. 9). In addition to the error propagation, the model covariance matrix, defined as *C* = **A**^{−} · (**A**^{−})^{T} provides a measure of model uncertainty (e.g. Nolet *et al.*1999). In QR factorization, due to orthogonality, *C* can be rewritten as, *C* = **R**^{−} · (**R**^{−})^{T} = (**R**^{T} · **R**)^{−} (e.g. Björck 1996; Gander *et al.*2014), and using its symmetry, only the upper triangular part needs to be computed. There are several algorithms that efficiently compute *C* or certain elements of it such as the diagonal elements (e.g. Björck 1996; Gander *et al.*2014).

## APPLICATION

To demonstrate the features and applicability of the sparse QR approach, we use the teleseismic data set from Colorado Rockies Experiment and Seismic Transects (CREST; Aster *et al.*2009; MacCarthy 2010) and obtain *P*-wave structure beneath Colorado. This data set has been used to estimate the diagonal elements of the resolution matrix using the matrix probing techniques (MacCarthy *et al.*2011) and provides a good comparison basis. The model space is parametrized in the same manner as in MacCarthy *et al.* (2011) using 267 520 constant velocity blocks of 0.25° × 0.25° × 25 km in size. We further follow MacCarthy *et al.* (2011) and regularize the least-squares problem, utilizing the Tikhonov regularization (Tikhonov & Arsenin 1977). The data vector **d** consists of 19 608 mean-removed teleseismic *P*-wave traveltime residuals from 167 stations (Fig. 2), and the forward operator **G** has the dimensions of 19 608 × 267 520. The regularization kernel **L**_{k × m}, with *k* = 570 232, is a combination of damping and Laplacian-smoothing in equal amounts, both scaled by the Lagrangian multiplier *a* that is set to 0.5 (MacCarthy 2010; MacCarthy *et al.*2011). The sizes of the matrices are large, and using conventional seismic tomography techniques and algorithms, this problem is considered unfeasible to solve directly.

Our setup is based on SuiteSparse 4.4.0 (Davis 2009, 2011; available online at http://faculty.cse.tamu.edu/davis/suitesparse.html) with MATLAB interface compiled with METIS 4.0.1 (Karypis & Kumar 1998) and Intel's Threading Building Blocks 4.3 (Reinders 2007). Various algorithms in the SuiteSparse package are tested for the best permutation of the input matrix for sparseness preservation, and we find that METIS algorithm (Karypis & Kumar 1998) yields the best result for the QR factors. All computations are performed in one node at Harvard's Odyssey Cluster, using 32 CPU cores, and total shared memory of about 250 GB. The parameters for SPQR algorithm are selected such that the ‘nthreads’ parameter is set to the number of cores and the ‘grain’ parameter is set to two times the number of cores. Both parameters control the multitasking performance of Intel's Threading Building Blocks, and their values comply with the recommended configuration of the algorithm (Davis 2009). The tolerance for treating a column as zero is set to have a 2-norm less or equal with 0.1. The matrix **A** is then successfully factorized, and the factors **R** and **Q** are stored, the latter in the economic form of Householder reflections. The resulting **R** factor is sparse with only 2.2 per cent of elements being non-zero (Fig. 3a).

The model vector **x** and the full resolution matrix $${\boldsymbol {\mathcal R}}$$ should be attainable from these factors, or more efficiently, by using the SPQR_SOLVE algorithm (Davis 2011) where **Q** is not stored at all (eq. 9). If the resolution matrix is sparse enough, this procedure is carried out efficiently with one call of the SPQR_SOLVE algorithm. However, the resolution matrix in this case is not sufficiently sparse and exceeds the available memory. To circumvent this problem, the **B** matrix is broken into 10 equally divided columnar strips. If a strip fails to produce results due to a lack of memory, the strip is further divided in half. The model parameters and the full resolution matrix (Fig. 3b) are obtained using this approach. The resolution matrix with 267 520 × 267 520 elements contain ∼40 per cent of non-zero entries and requires over 200 GB of storage space. Reduction in storage can be achieved by eliminating entries with absolute values smaller than 10^{−3}. These elements are practically zero, and the application of the threshold leads to significant gains in hard disk space, decreasing the ‘non-zero’ entries to ∼0.5 per cent and required storage to about 3 GB. This thresholded version of $${\boldsymbol {\mathcal R}}$$ is considered for the remainder of this manuscript.

Both the estimated model and diagonal elements of the resolution matrix are compatible with the results from MacCarthy *et al.* (2011). The relative difference $$\varepsilon_{\boldsymbol x}$$ between the estimated models following MacCarthy *et al.* (2011) approach that utilizes the LSQR algorithm, $${\boldsymbol x}_{{\rm{LSQR}}}$$, and the QR approach described here, $${\boldsymbol x}_{{\rm{QR}}}$$, calculated as $$\varepsilon_{\boldsymbol x} = \|{\boldsymbol x}_{{\rm{LSQR}}} - {\boldsymbol x}_{{\rm{QR}}}\|/\|{\boldsymbol x}_{{\rm{LSQR}}}\|$$, is 0.03 per cent (Fig. 4; Supporting Information Fig. S1). Furthermore, the relative difference ε_{diag}, of the diagonal elements of the resolution matrix between the stochastic approximation of MacCarthy *et al.* (2011), $${\boldsymbol {\mathcal R}}_{{\rm{sto}}}^{{\rm{diag}}}$$, and the QR approach, $${\boldsymbol {\mathcal R}}_{{\rm{QR}}}^{{\rm{diag}}}$$, calculated as, $${\varepsilon _{{\rm{diag}}}} = \|{\boldsymbol {\mathcal R}}_{{\rm{sto}}}^{{\rm{diag}}} - {\boldsymbol {\mathcal R}}_{{\rm{QR}}}^{{\rm{diag}}}\|/\|{\boldsymbol {\mathcal R}}_{{\rm{sto}}}^{{\rm{diag}}}\|$$, is 5.40 per cent. This result does not change, up to the second decimal place, even if we apply the same precision of 10^{−3} to $${\boldsymbol {\mathcal R}}_{{\rm{sto}}}^{{\rm{diag}}}$$. The relative difference between $${\boldsymbol {\mathcal R}}_{{\rm{sto}}}^{{\rm{diag}}}$$ and its lower-precision version is 0.27 per cent, confirming that small entries, even for the diagonal elements, do not contribute significantly to the structure of the resolution matrix.

The availability of the full resolution matrix allows the evaluation of the trade-offs between model parameters. This can be visualized by plotting the specific rows of the resolution matrix (Fig. 5). The example in Figs 5(a) and (b) illustrates that trade-offs extend over large distances, well exceeding the region where the diagonal element of the resolution matrix is relevant. Even if the absolute values of individual off-diagonal elements are relatively small, the cumulative effect of these values associated with large number of model parameters can be significant. This implies that estimations of uncertainty are inaccurate, especially when the diagonal elements of the resolution matrix have small values.

## CONCLUSIONS

Advances in computational science and hardware infrastructure are at a stage where large tomographic problems can be efficiently solved using direct methods. In addition, most seismic tomography matrices are sparse, and this feature can be used to further accommodate the analysis. Depending upon the algorithm and the implementation, the users can adjust some parameters of the algorithms such as the column 2-norm tolerance to determine when a column is ignored, but many are hard-wired. For example, ‘zero threshold’ is typically defined as the tolerance threshold below which an element is considered zero and dropped during numerical operations. Software (including Sparse QR) and hardware that complies with IEEE-754 format for the floating point numbers use the smallest denormalized number as the zero threshold, which is 4.9407e-324 for double precision calculations. This means that any element of a matrix above this value would be considered non-zero, and therefore stored in memory, even if it becomes as small as 1e-323. This threshold is unrealistically small for many numerical applications, including seismic tomography, and results in pseudo-dense matrices, for example, during the computation of the resolution matrix. Introducing modifications to existing solvers based upon the particular needs and properties of seismic tomography problems, such as an option to change the ‘zero threshold’, can significantly increase the computational efficiency. For example, as mentioned previously, the requirement to break **B** into smaller strips to fit into the memory, and therefore to repeat the factorization procedure, is necessary because the product of **Q**^{T} with **B** is dense, or pseudo-dense. Changing the ‘zero threshold’ parameter would allow small values to be dropped as they are computed individually, instead of storing the dense result and removing entries using the 10^{−3} threshold. This would accelerate the procedure dramatically, since it would allow the computation of both the inversion solution and the full resolution matrix with only one factorization, requiring total computation time slightly more than one factorization time (∼12 hr). Furthermore, it would significantly reduce the total required memory, allowing the computations to be carried out on a desktop computer.

One of the main advantages of direct sparse methods is the assessment of the model resolution through accurate calculation of the full resolution matrix, including both diagonal and off-diagonal elements. The large number of parameters (i.e. columns), however, poses challenges in efficient examination and exploration of this matrix. One possible future direction is using algorithms similar to those applied to large networks and graphs (e.g. Batagelj & Mrvar 2003; Bullmore & Sporns 2009; Cohen & Havlin 2010; Davis & Hu 2011; Kenett & Havlin 2015). The resolution matrix can be considered as a graph where each element $${{\boldsymbol {\mathcal R}}_{ij}}$$ describes the trade-off relationship between the parameters *i* and *j*.

The authors would like to thank the editor, Ingo Grevemeyer, and two reviewers Adrià Meléndez and Jonathan MacCarthy for their constructive comments and suggestions that helped to improve the manuscript.

## REFERENCES

**178**

## SUPPORTING INFORMATION

Additional Supporting Information may be found in the online version of this paper:

**Figure S1.** A comparison of the *P*-wave perturbation models at 20 km (top row) and 400 km (bottom row) depths obtained by using the approaches presented in the manuscript (a,c), and LSQR algorithm of MacCarthy *et al.* (2011) (b,d). There are no systematic differences between the two models regardless of the poor- or well-resolved regions of the model space.

(http://gji.oxfordjournals.org/lookup/suppl/doi:10.1093/gji/ggw052/-/DC1).

Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the paper.