Square-root variable metric based elastic full-waveform inversion—Part 2: uncertainty estimation

S U M M A R Y In our first paper (Part 1) about the square-root variable metric (SRVM) method we presented the basic theory and validation of the inverse algorithm applicable to large-scale seismic data inversions. In this second paper (Part 2) about the SRVM method, the objective is to estimate the resolution and uncertainty of the inverted resulting geophysical model. Bayesian inference allows estimating the posterior model distribution from its prior distribution and likelihood function. These distributions, when being linear and Gaussian, can be mathematically characterized by their covariance matrices. However, it is prohibitive to explicitly construct and store the covariance in large-scale practical problems. In Part 1, we applied the SRVM method to elastic full-waveform inversion in a matrix-free vector version. This new algorithm allows accessing the posterior covariance by reconstructing the inverse Hessian with memory-affordable vector series. The focus of this paper is on extracting quantitative and statistical information from the inverse Hessian for quality assessment of the inverted seismic model by FWI. To operate on the inverse Hessian more efficiently, we compute its eigenvalues and eigenvectors with randomized singular value decomposition. Furthermore, we collect point-spread functions from the Hessian in an efficient way. The posterior standard deviation quantitatively measures the uncertainties of the posterior model. 2-D Gaussian random samplers will help to visually compare both the prior and posterior distributions. We highlight our method on several numerical examples and demonstrate an uncertainty estimation analysis applicable to large-scale inversions.


I N T RO D U C T I O N
Full-waveform inversion (FWI) addresses the geophysical inverse problem of estimating subsurface model parameters from observed waveform data. In most geophysical applications, FWI is introduced as an iterative, local optimization problem that attempts to minimize the least-squares residuals between observed and synthetic data. Mathematically, the inverse problem is ill-posed, leading to a nonuniqueness of the solutions. It remains challenging to solve inverse problems pratically due to limitations in data acquisition, measurement uncertainties and the non-uniqueness of the solution. Several methods in geophysics have been proposed to attack this FWI challenge. We can divide them mainly into two categories: deterministic methods such as gradient optimization-based ones (see, e.g. Lailly 1983;Tarantola 1984Tarantola , 1987Pratt 1999;, and probabilistic approaches such as Bayesian inversions (see, e.g. Gouveia & Scales 1998;Käufl et al. 2013;Biswas & Sen 2017).
Deterministic methods on FWI have been well developed in tackling several difficult issues, such as convergence rate, cycle-skipping and multiparameter trade-offs (Bamberger et al. 1979;Tarantola 1987;Crase et al. 1990;Bunks et al. 1995;Igel et al. 1996;Tromp et al. 2005;Warner & Guasch 2016). Probabilistic inversions of FWI are in principle preferred solutions to an inverse problem as they additionally provide posterior distributions for assessing statistical informations about the solutions, but often become prohibitive due to computational costs. From the viewpoint of Bayesian inference, the solution of the inverse problem should not be limited to a single set of inverted values, but be represented by a probability density function (PDF) to quantify the posterior model uncertainty (Tarantola 2005).
Estimations of the resolution or uncertainty in seismic inversions have a long history in geophysics and can be analysed with mathematical tools such as the posterior covariance matrix (Tarantola & Valette 1982). The posterior covariance matrix is closely related to the Hessian matrix (Fichtner & Trampert 2011a,b;Fichtner & van Leeuwen 2015;Zhu et al. 2016). However, for practical problems with millions of parameters it is unfeasible to store such vast matrices. To handle large-scale inverse problems, Zhang & McMechan (1995) modify classic inversion algorithms with least-squares QR factorization. An (2012) evaluates the spatial resolution lengths SRVM based elastic FWI 1101 with a Gaussian approximation to the resolution matrix. Trampert et al. (2012) sample the tomographic models for resolution lengths with random probing. Fichtner & van Leeuwen (2015) analyse the direction-dependent resolution lengths of waveform tomography by autocorrelating the randomly sampled Hessian. We refer to Rawlinson et al. (2014) to provide a recent review about uncertainty estimations.
With the development of matrix probing theories in applied mathematics, randomized singular-value decomposition (SVD) attracted attention also to geophysicists (Halko et al. 2011;Liberty et al. 2007). Bui-Thanh et al. (2013) formulate waveform tomography in a Bayesian inference workflow, deriving an approximation to the posterior covariance matrix by decomposing the data-misfit Hessian into eigenvalues and eigenvectors with randomized SVD. Zhu et al. (2016) improve the efficiency of the Hessian computation by exploiting point-spread function (PSF) tests. Similar research can also be found in Mosegaard & Tarantola (1995), Gouveia & Scales (1998), Sambridge & Mosegaard (2002) and Osypov et al. (2013). Besides, more recent ensemble-based approaches using Kalman Filter (KF) theory (Kalman et al. 1960;Evensen 1994) have been applied to tomography problems by Jordan (2015) and Thurin et al. (2017).
In this second paper about the square-root variable metric (SRVM) approach in FWI, we are focusing on uncertainty estimation based on the Hessian-related approximation. As mentioned in the first part of our research (Liu et al. 2018), we approximate the inverse Hessian more directly by using the SRVM method (Morf & Kailath 1975;Williamson 1975;Hull & Tapley 1977). Theoretically, the SRVM method is capable of collecting the information of the second-order derivative from the initial model to the final, inverted model. We write the SRVM algorithm into a vector version to make it memory-affordable for large-scale problems. We then implement the vector-version SRVM into elastic FWI (Liu et al. 2018). The focus of this paper is to retrieve the second-order derivative information from the stored SRVM scalar and vector series in a recursive manner, after convergence of the inversion. The vector-version SRVM algorithm approaches the inverse Hessian in a low-rank form. To make this method computationally more flexible and efficient, we furthermore incorporate the randomized SVD to the SRVM approach.
In the following, we first review the theory of Bayesian inference for FWI. Next, we discuss how to reconstruct the inverse Hessian with the vector-version SRVM algorithm and relate the reconstructed matrix to the covariance matrix. Subsequently, we introduce randomized SVD into SRVM. Also, we discuss the possibility of a novel approach to get the PSFs of the Hessian and attempt to draw random samples from the posterior covariance. Finally, we verify our methods above on numerical examples to demonstrate the information gain of uncertainty analysis for seismic inversions.

I N V E R S E T H E O RY
For seismic applications, different concepts can be used to explain the inversion of seismic data to determine subsurface structures. In the theory of probabilistic inversions, the solution of an inverse problem is defined as a marginal probability distribution in the model parameter space (Tarantola 2005). This definition is very general and elegant. However, to complete the characterization of such a probability distribution, we are usually required to evaluate a large number of samples over the parameter space. Therefore, the feasibility of probabilistic theory to practical problems strongly depends on the size of the model parameter space.

Review of the Bayesian inference
Bayesian inference is a method in which the prior probability is updated as ever more information becomes available (Tarantola 2005). When applied to FWI workflows, Bayesian inference allows us to incorporate the prior information into waveform tomography to estimate posterior uncertainties of the inverted results. To review, we start from a short glance at the forward problem which predicts the values of the observables d for a given model m via the (usually nonlinear) operator g. The forward problem calculates what should be observed for a particular model, whereas the inverse problem calculates what the particular model should be for a set of observations. Eq. (1) denotes the forward-modelling process which relates the model parameters and data observables. Herein, it represents the numerical solution of the elastic wave equation based on a spectral-element method (Seriani & Priolo 1994;Faccioli et al. 1996;Komatitsch & Vilotte 1998;Komatitsch & Tromp 1999).
Assuming the model prior is Gaussian distributed (Gauss 1809), we express the prior probability distribution function (PDF) as where m prior is the prior model with its mean being the (initial) model m 0 , and C m is the prior model covariance matrix. ρ(m) can be inferred from geophysical surveys, for example, well logs. The pragmatic choice of Gaussian priors is very difficult to avoid because of the central limit theorem. However, note that such a Gaussian prior assumption can be invalid for some seismological applications, for example, when involving parameter discontinuities. For eq. (1), even when the targeted parameter model m happens to coincide with the true parameter model, the synthetic data may still differ from observations due to data noise and measurement errors. Assuming the inherent discrepancy between the observed and synthetic data to be distributed as Gaussian noise, we represent the likelihood function of the data as where C d is the data covariance matrix indicating the data uncertainties. According to Bayesian inference, the solution to an inverse problem yields the posterior PDF, which is defined as in which ρ(m|d) indicates the posterior probability, ρ(d|m) the likelihood and ρ(m) the prior probability. Here, ρ(m) is related to the model misfit and ρ(d|m) is related to the data misfit. Combining eqs (2) and (3) into eq. (4), we have where f(m) is the misfit function in the least-square sense, namely Note that even with the Gaussian assumption for the PDFs of data error and model prior, the posterior PDF might be non-Gaussian Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019 due to the non-linearity of the forward modeling operator g(m)) in the likelihood function of eq. (3). One simplistic approach is to linearize the forward modeling operator around the maximum a posteriori (MAP) model (Gouveia & Scales 1998;Bui-Thanh et al. 2013), which gives the posterior PDF as with the MAP model being and the posterior covariance being In eqs (8) and (9), G = ∂g(m)/∂m is the Fréchet derivative. In eq. (9), we have H d = G T C −1 d G, which is known as the Gauss-Newton approximation to the data-misfit Hessian (Pratt 1999;). The direct calculation of the Fréchet derivative, that is, the sensitivity kernel according to its definition is impossible in realistic models. An efficient solution to this problem is the adjoint method (Tarantola 1984(Tarantola , 1987Tromp et al. 2005). This method allows us to obtain the first-order derivative by solving the forward problem once and its adjoint problem once. In this study with the absence of attenuation, the wave equation is self-adjoint, so there is in principle no need for automatic differentiation techniques (Sambridge et al. 2007;Tan et al. 2010;Vlasenko et al. 2016) to solve the adjoint problem, instead the same implementation as for the forward solution can be used in reversed time. From eq. (9), we realize that the key towards the uncertainty estimation is to efficiently quantify the second-order derivatives H d .

Constructing the inverse Hessian via SRVM vector series
In Newton's method, the model perturbation m is related to the gradient g via the Hessian H by where B ≡ H −1 . The standard Davidon-Fletcher-Powell (DFP) method (Fletcher & Powell 1963) gives an iterative approach to update the inverse Hessian, namely A stable and practical algorithmic form modified from the DFP method is the vector-version SRVM, whose detailed workflow can be found in Algorithm (2) in Liu et al. (2018). Considering eq. (11), we see that at convergence of the algorithm, the matrix B k + 1 is a positive definite approximation to collect the information about the inverse Hessian over all the n iterations (Tarantola 2005) as follows: where B 0 , the initial guess about the inverse Hessian, is usually chosen as the identity matrix I for convenience. Even when we have a good estimation of the diagonal Hessian, the estimated Hessian diagonals are imposed as point-wise weights over the model parameter space (Métivier et al. 2013), whereupon the identity matrix I remains. We can furthermore set B 0 = C m , once given a prior covariance matrix C m in a diagonal form. Also, B 0 acts as a stabilizer to ensure a sufficiently stable B k during its iterative update. Being a rank-2 updating approach, DFP (or SRVM) also has a self-correcting behaviour. For example, if B k incorrectly estimates the curvature of the misfit function, B k will tend to correct itself within a few successive steps (Nocedal & Wright 2006). The historical information about the gradients and model updates exists on the right-hand side (RHS) of eq. (12), so we take B 0 = I as the stabilizer and retrieve the approximated inverse Hessian H −1 by The relation in eq. (13) resembles the Levenberg-Marquardt algorithm (damped least-squares inversion) (Nocedal & Wright 2006). After the SRVM-based elastic FWI converges, we can use the past scalar series ν k P k and vector series w k (see Algorithm (2) in Liu et al. 2018) to reconstruct B n + 1 by: The resulting B n + 1 is symmetric positive definite, with its corresponding square-root matrix S n + 1 . In large-scale problems, the full reconstruction and storage of B n + 1 in memory remain impractical. Fortunately, it is possible to extract the column (row) elements from B n + 1 via a unit pulse probing vector (0,0,...,0,1,0,...,0,0), in which the unit pulse '1' locates at the target column (row) index i = 1, .., M, respectively. Here, M is the size of the model parameter space. Assumingê i is the probing vector, we probe B n + 1 as following: We fetch B n+1êi = b i when both loops in Algorithm (2) have finished. However, this probing method is not very efficient to extract arbitrary elements from B n + 1 . We can only extract one column (row) vector from B n + 1 each time. To extract all diagonal elements, we need M such operations in Algorithm (2). Next, we will demonstrate how to make the probing of B n + 1 more convenient.

Randomized SVD for the SRVM method
Algs (1) and (2) as well as eq. (13) present a low-rank approach to construct the inverse Hessian H −1 . However, the probing method mentioned in the section above requires further improvement to be convenient for large-scale applications where the size M of the model parameter space is very large. Randomized Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019   Given a symmetric M × M matrix Z and a set of M × N r random vectors X, where M is the model size and N r the estimated rank order of matrix Z. As shown in Appendix A, the single-pass randomized SVD algorithm (Halko et al. 2011) processes the matrix Z as follows: For our purpose, we set Z = H −1 = B n + 1 − I. With the reconstruction algorithm in Algorithm (1), it is still impossible to explicitly store H −1 for large-scale practical problems. Algorithm   (2) works in probing H −1 , but with poor efficiency. However, if we incorporate Algorithm (3) into Algorithm (2), then the probing of H −1 becomes efficient even for large M, as N r is usually much smaller than M. We simply need to put the set of random vectors X in Algorithm (3) into Algorithm (2), and loop Algorithm (2) for N r times to obtain Y = B n + 1 X − X. Here, X comprises the N r random vectors x j (j = 1, 2, ..., N r ). Y is the same size as X. From Algorithm (1), which can be taken as a low-rank form, we know that N r equals the number of iterations. We finally have the inverse Hessian H −1 in a SVD form as in which V is a M × N r eigenvector matrix, and is the eigenvalue matrix only with N r diagonal entries. Incorporating randomized SVD within the SRVM algorithm provides a convenient probing of  the inverse Hessian H −1 , both for column (row) elements as well as for diagonal elements.

SVD approach to point-spread function tests
Decomposing the spectrum by randomized SVD makes handling H −1 more convenient. Furthermore, after taking the inverse of we can also access the Hessian: We recall that a single point-spread function (PSF) represents one column (row) of the Hessian. We can extract PSFs from eq. (15) for resolution analysis (Fichtner & Trampert 2011a,b;Bui-Thanh et al. 2013;Fichtner & van Leeuwen 2015). We can also extract diagonals from eq. (15) to show the source illumination, representing the data coverage of the acquisition geometry. Inverse problems such as FWI suffer from being ill-conditioned, mostly due to a lack of data coverage. Therefore, the Hessian might be a singular matrix with a condition number tending towards infinity. Large condition numbers of the inverse Hessian lead to amplification of data noise being mapped to the model space. Thus, the truncated SVD here is equal to a norm damping regularization of the inverse problem. The truncation value for could in principle being chosen by an L-curve assessment. For our purposes of probing the Hessian matrix, we prefer to sample H as exact as possible, thus choose a truncation value well below a perceivable norm damping effect.

Practical assessment of the posterior covariance
Eq. (6) provides an elegant perspective on the objective function, connecting least-squares inversion with Bayesian inference. However, in waveform tomography, we do not use eq. (6) directly but rather use a generalized least-squares formulation in a more practical form (Rawlinson et al. 2014;Modrak & Tromp 2016;Modrak et al. 2018) as where δm = m − m 0 is the model perturbation, C d the data covariance matrix, and C m the prior model covariance matrix, with m 0 being the prior mean model (initial model) for a Gaussian distributed prior m prior . D is a second-order derivative smoothing operator (Trinh et al. 2017). ε and η are two constants supplied by the user to control the damping weight of the last two terms regarding the data misfit. When ∂f(m)/∂m = 0, the solution of eq. (16) occurs as in which m is the inverted model (also known as the MAP model), and G the Fréchet derivative. D T D in eq. (17) works as the smoother, so we rewrite a practical form of the posterior covariance as Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019  We notice that it is difficult to determine the relative magnitude between H d and C −1 m although ε acts as a tuning factor. When ε → ∞, C M → 0. The result is misleading. When the magnitude of ε C −1 m is much higher than that of G T C −1 d G, C M → C m . This is also misleading in that the model misfit dominates the inversion. When ε → 0, we can take ε C −1 m as a stabilizer such that This last treatment is more reasonable and desirable than the previous ones (Rawlinson et al. 2014). Similarly, in the Bayesian transdimensional scheme by Bodin & Sambridge (2009), they mainly consider the data misfit function without the damping or smoothing terms. The idea that the Hessian is the inverse posterior covariance matrix in the vicinity of m can also be found in Fichtner & Trampert (2011b) and Fichtner & van Leeuwen (2015). Therefore, we rewrite eq. (18) as in which we take ε I as the stabilizer, C 1/2 m as spatial weights on the Fréchet derivative G, and H −1 = (H d C m + ε I) −1 as a 'big filter' to connect C M and C m . With C 1/2 m absorbed into G, we approximately have where H −1 has been approached in eq. (14). The prior covariance C m , which can be estimated from subsurface orientations or well logs (Fomel & Claerbout 2003), indicates the known information before waveform tomography. Herein, we leave the precise estimation about prior covariance for our future research. We simply take the diagonal of C m as constants in a vector form for the sake of simplified implementation. The square-root diagonal of C M , known as the standard deviation, provides a quantitative measure of the posterior distribution.

Random sampling on prior and posterior distributions
After obtaining a low-rank SVD approximation to the posterior covariance C M , we can draw and compare a Gaussian random sampling on C m and C M . The sampling of the prior and posterior distributions can be expressed, respectively, as where m 0 is the prior mean model (initial model) for a Gaussian distributed prior model m prior , n a 2-D Gaussian random sampler of zero mean and unit variance (Tarantola 2005) in the 2-D case, or a 3-D Gaussian random sampler in the 3-D case. Note that the independent 1-D normal random samplers are qualified enough for the sampling operations of randomized SVD in Algorithm (3), because the generation of 2-D/3-D Gaussian random samplers is computationally more costly. We compute the square root of the posterior covariance matrix in eq. (22) by where V and are the eigenvectors and eigenvalues of H −1 , respectively. This way, we can assess the prior and posterior model uncertainties through visual comparisons of the random samplings on the prior and posterior distributions.

N U M E R I C A L E X A M P L E S
The solution of an inverse problem depends on the chosen model parametrization. For elastic media, this could be P-and S-wave velocities and density, or bulk and shear modulus and density, or a parametrization involving bulk wave speed or impedances. In this paper, we focus on the inversion of the P-and S-wave velocities in elastic media, but keep the density fixed due to its low sensitivity in waveform tomography ). In the following, we use the 2-D elastic Marmousi model to demonstrate the applications of our methods. For our uncertainty analysis, the standard deviation will serve as a measure of uncertainty and the samplings of the prior and posterior distributions figure as visual inspections. Our demonstration comprises (i) SRVM plus randomized SVD, (ii) point-wise standard deviation of the posterior PDFs, (iii) a visual comparison between the prior and posterior distributions, (iv) 2-D randomly sampled V P and V S model perturbations from posterior PDFs and (v) Hessian diagonals and Hessian PSFs.

2-D elastic Marmousi model
In the 2-D elastic Marmousi (Martin et al. 2006) benchmark, the models are 9200 m long and 3000 m deep. The target and initial    models are shown in Figs 1 and 2. We have an acquisition system placed 10 m below the surface, with 32 shots and 500 receivers distributed evenly. The source function is a Ricker wavelet with dominant frequency at 4 Hz. The elastic FWI meets the stopping criterion at the 46th iteration. Its normalized data misfit reduces to a value of 0.02. The inverted models are shown in Fig. 2. We exhibit more details about the comparisons between L-BFGS-based and SRVM-based elastic FWIs in different cases in Liu et al. (2018). Also, in Liu et al. (2018), after 46 iterations, we store 46 SRVM scalars and vectors in the elastic Marmousi test. We will use these stored scalars and vectors to reconstruct the inverse Hessian and to access to the posterior covariance and the Hessian. We begin by evaluating the memory consumption of SRVM vectors in our method. The full (inverse) Hessian is prohibitive to build because of its vast size. In our method, assuming that the model size is M, the full Hessian is of size M × M, whereas the required memory by SRVM is M × N iter . Here N iter denotes the number of iterations. The memory requirement of the N iter SRVM scalars is negligible; usually, N iter is far less than M. Although Algorithm (2) can be used to probe the posterior covariance, it is not very efficient in, for example, extracting the diagonals. To facilitate the matrix probings and operations, we incorporate a randomized SVD (Halko et al. 2011) into the SRVM algorithm by combining Algorithm (2) with Algorithm (3). In randomized SVD, we employ one sampler set composed of 46 independent 1-D random vectors. We then represent the inverse Hessian in a SVD form. Note that the rank order of the factorized matrix by randomized SVD is the same as that of its original matrix.
The eigenvalues of the inverse Hessian H −1 are shown in Fig. 3. In Fig. 4 we display four eigenvectors associated to different eigenvalues retrieved, including the strongest and the weakest one. Note that the retrieved eigenvalues cover a range of ∼4 orders of magnitude while their eigenvectors are normalized. We observe from Fig. 4 that as the order of the eigenvector increases, the energy distribution in the eigenvectors moves gradually from the bottom to the top of the subsurface model. We understand this trend from the inverse of H −1 , that is the Hessian, whose energy distribution should fade away from the surface to the bottom within this reflection acquisition setup. The trends in Fig. 4 also imply that it is possible to obtain the approximated Hessian by taking the inverse of H −1 , which we will show next.
After obtaining the eigenvalues and eigenvectors, we can extract the diagonal of H −1 easily. The extracted square-root diagonals of the inverse Hessian are presented in Fig. 5 as a 2-D 'standard deviation map'. Both the magnitude and the appearance of our map resemble those of Fig. 5 of Trinh et al. (2017), who uses the Ensemble Kalman Filter (EnKF) method. This map provides a quantitative way to evaluate the uncertainty of inversion results. According to the physical interpretation of the Hessian, its energy decreases gradually with depth due to geometrical spreading effects. We can thus infer that the energy of the inverse Hessian should counteract this effect and basically increase with depth. We also note that certain high-velocity structures have high uncertainties associated to them. This phenomenon can be explained in that a high-velocity structure bends the wave path such that the wave energy is more likely to probe only the onset of the region but fails to sample the full area. In other words, it reflects the experience in FWI that subsurface areas with better data coverage will be better constrained. Note that for some very marginal areas (i.e. the absorbing layers), their uncertainties become zero. This occurs because in our method we can not adjust areas where the Fréchet kernels are always zero. Similar explanations are also presented in Kennett et al. (1988) and Rawlinson et al. (2014).
We also compare the standard deviation maps between V P and V S and notice that the map of V P often has larger standard deviations than V S . The difference between them can be explained from the perspective of radiation patterns of different elastic parameters as in Wu & Aki (1985); Tarantola (1986); . In their views, the radiation pattern of V P is isotropic whereas that of V S is far-offset dominated. This means, in the simultaneous elastic inversion, the V S kernel possesses more low-wavenumber components than the V P kernel. Low-wavenumber components however often have stronger amplitudes (Zhang & Sun 2009) and less chance to fall in a local minimum ).
Herein, we leave the estimation of the prior information as an open question and instead set the same standard deviation value of 250 m s -1 everywhere. With this value and under a Gaussian distribution, the term C 1/2 m n in eq. (21) ranges between [−1000, 1000] m s -1 . The 2-D Gaussian random fields for V P and V S are shown in Fig. 6. We generate 1000 such random fields independently and use them for the sampling. We visualize prior distributions of V P and V S against depth at three different model positions in Figs 7 and 8, respectively. It is clear that in Figs 7 and 8 the same uncertainty occurs everywhere. We then infer the posterior distribution with eq. (22). The resulting posterior V P and V S distributions against depth can be found in Figs 9 and 10. We notice that the Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019 visualized distributions are not homogeneous anymore, and reflect the different standard deviation values at different depth points. Thus, the heterogeneities indicate the point-wise uncertainties of the posterior model.
For detailed investigations, we pick out six points, respectively, from the prior and posterior V P and V S distributions, as shown in Figs 11 and 12. In these figures, both the prior and posterior distributions have Gaussian shapes, as expected due to the sampling and approximations used. Still, several point distributions nicely indicate a shift between the prior mean and posterior mean which is due to the elastic FWI and choice of initial model. We also notice that the posterior distributions are more 'concentrated', that is, have smaller standard deviations, than the prior ones, meaning that the posterior model involves less uncertainties than does the prior model. Therefore, the data inversion led to an information gain and better constrains the subsurface model parameters at these locations.
For a full 2-D plane view of how the C 1/2 M n looks like in eq. (23), we draw four different samples as shown in Fig. 13. Each sample is obtained by a matrix-vector multiplication with different n. We note that the posterior model perturbations are spatially continuous. Because the prior model perturbations are computed by C 1/2 m n, in which we make a homogeneous diagonal approximation to C 1/2 m , the distribution of C 1/2 m n should be similar to n but only differ in magnitude. Thus, they would be similar to those in Fig. 6 and we omit to plot them again. As for C 1/2 M n, an interesting aspect is that the posterior samples have the pattern of the MAP model. The variance of C 1/2 M n between different samples helps us to analyse the posterior uncertainties. We also show the posterior samples denoted by eq. (22) in Fig. 14.
To utilize the second-order information by SRVM, we approximate the Hessian with eq. (15). The square-root diagonals of the Hessian are shown in Fig. 15. We take the square roots because according to its original definition, the Hessian contains four-fold geometrical spreading effects. Thus, the square-root form gives a better visualization of the spatial variance of Hessian diagonals. Fig. 15 underscores the physical interpretation: energy strength decreases from top to bottom, including effects of the elastic subsurface structure. Furthermore, this relates to the source illumination and will have an effect on the point-spread function analysis in the following.
We can extract point-spread functions from eq. (15) conveniently, where a point-spread function (PSF) represents one row (column) from the Hessian. To demonstrate such examples, two V P and V S PSFs are displayed in Figs 16 and 17, respectively, where the 'crosses' mark the row (column) indexes. These cross locations can be interpreted as the sites where a localized 'perturbation' to the prior model parameter would occur and the PSFs then reflect the 'smearing' of the perturbation due to an imperfect resolution operator in the inversion. In case the inversion resolution would be perfect, the same point location perturbation would be retrieved by the PSFs. We observe that the PSFs usually have some orientationdependent shapes. Also, the plotted PSFs are concentrated locally close to the original 'perturbation' locations. Still, for such an exploration setup, it would be difficult to assign a confidence region in form of an ellipsoid around the PSF to estimate resolution lengths in horizontal/vertical directions. Since in our examples the chosen locations are somewhat close to the surface and thus in areas where the Hessian is dominated by the diagonal elements, the PSFs show little trade-offs between V P and V S parameters. By comparing V P and V S PSFs, we notice that the V S PSFs are more focused and have weaker trade-offs compared to the V P ones. This relates to the Hessian, as shown in Fig. 15, where the strengths of the diagonals help to explain why the V S model is usually better resolved than the V P model.

D I S C U S S I O N
Results of geophysical inversions have inherent uncertainties due to limitations in measurements and theories. Bayesian inference provides an elegant description regarding uncertainties of such geophysical inversions in terms of the posterior PDF in the model space. For elastic FWI, however, a complete characterization of the posterior PDF using statistical sampling methods is prohibitive because of the highly multidimensional model space. One simplistic and feasible approach is to locate the minimum of the misfit function, which usually corresponds to the MAP model, under the assumption of linearized forward modeling operators and Gaussian distributed PDFs. Most of the gradient-optimization methods focus on finding the MAP model, but without uncertainty estimation.
In our elastic FWI study, and within the presented gradientoptimization framework, we can obtain the MAP model together with the posterior covariance for uncertainty estimation using a quasi-Newton algorithm named SRVM. To allow for large-scale applications, we formulate the SRVM algorithm in a vector version to alleviate the memory storage burden. The main advantage then of the SRVM algorithm is that after convergence of the elastic FWI, we can retrieve the information of the posterior covariance from the stored SRVM vector and scalar series. Incorporating the randomized SVD into our SRVM approach facilitates this information retrieval also for large-scale applications. Thus, we can estimate the uncertainty of the MAP model via analysis methods such as point-wise standard deviations, random samplings and point-spread function tests.
Our current study, however, is limited to the assumptions of linearized forward modeling operators and Gaussian distributed PDFs. Note that even with simplified Gaussian priors, the resulting posterior PDF may be non-Gaussian and highly complicated in shape due to the presence of model nonlinearities. Also, gradient-optimization methods often find a minimum in the neighborhood of the starting model. In situations such as FWI, where the misfit function has several secondary minima, one needs to either choose multiscale strategies or start close to the global minimum to overcome such shortcomings. We will further explore and address these limitations in future research.

C O N C L U S I O N S
This paper, Part 2 of our study, concerns the extension of the theories and methods on SRVM-based elastic FWI presented in Part 1 of our study. Herein, we estimate the uncertainty of the maximum a posteriori model, in which the challenges are high-dimensional model spaces, multiparameters, and computational limitations. We extract the information about the inverse Hessian from the SRVM vector series during elastic FWI. Since the SRVM algorithm is embedded in the elastic FWI, the SRVM vector series inherently captures the information about multiparameters. The computational costs are determined by the SRVM-based elastic FWI, which is similar cost-effective as an L-BFGS-based elastic FWI.
To analyse and manipulate the inverse Hessian efficiently, we factorize it into eigenvalues and eigenvectors using randomized SVD. Based on the randomized SVD results, we can also approximate Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019 the Hessian, and thus have an elegant access to point-spread functions. Furthermore, we relate the inverse Hessian with the posterior covariance and assess the uncertainty of the posterior model parameters based on point-wise standard deviations, random samplings and point-spread function tests.

A C K N O W L E D G E M E N T S
The authors are grateful to editor Jean Virieux and reviewer Andreas Fichtner and an anonymous reviewer for improving the initial manuscript. The authors are grateful to Carl Tape for inspiring discussions and valuable inputs to improve the manuscript. This work was supported by the King Abdullah University of Science & Technology (KAUST) Office of Sponsored Research (OSR) under award No. UAPN#2605-CRG4. Computational resources were provided by the Information Technology Division and Extreme Computing Research Center (ECRC) at KAUST.

A P P E N D I X A : S I N G L E -PA S S R A N D O M I Z E D S V D
When an input matrix Z is too large, it might become infeasible for users to revisit it during randomized SVD. Therefore, Halko et al. (2011) developed the single-pass randomized SVD, which requires just one pass over the input matrix.
In our presented Algorithm (3), the input matrix Z is a symmetric M × M matrix. After convergence of the SRVM-based elastic FWI, we have N r SRVM vectors to approximate the inverse data-misfit Hessian, with N r being the total iteration number. Therefore, we use a set of M × N r randoms samplers for the factorization of Z, whose rank can not be greater than N r .
When the QR decomposition in Algorithm (3) is done, we can define a matrix A as using the orthogonal matrix Q and symmetric matrix Z. Note that QQ T = I. Post-multiplying both sides of eq. (A1) by Q T X yields in which Z = Z(QQ T ) and Y = ZX. From this, we have which is the third line in Algorithm (3). After solving for A, we can obtain a low-rank factorization which can lead to a SVD problem with A = U U T and V = QU.

A P P E N D I X B : C L A S S I C A L R E S O L U T I O N M O D E L T E S T C A S E : 2 -D E L A S T I C C H E C K E R B OA R D M O D E L
In the 2-D elastic Checkerboard test, the models are 9000 m long and 3000 m deep. The true and initial models are shown in Figs B1 and B2. We have an acquisition system located 10 m below the surface, with 32 shots and 500 receivers distributed evenly. The source function is a Ricker wavelet with dominant frequency at 4 Hz. The elastic FWI converges at the 100th iteration. Its data misfit reduces to a value of 0.001. The inverted models are shown in Fig. B2. A detailed comparisons between L-BFGS and SRVMbased elastic FWIs for different cases can be found in Part 1 of our study (Liu et al. 2018). We handle the 2-D elastic Checkerboard model similarly to the 2-D Marmousi model. One notable difference between the two models are their PDFs. The Checkerboard PDFs are more spread out. This may be caused by more reflection waves existing in the Checkerboard test. For better illustration, we show all related tests in Figs B1-B16. Figure B1. True V P and V S models used for data generation in the elastic Checkerboard test. Figure B2. Initial and inverted V P and V S models in elastic Checkerboard test. Top panel: prior V P and V S models obtained by smoothing the true models in Fig. 1. Bottom panel: inverted V P and V S models by SRVM based elastic FWI after 100 iterations (details about data and model misfits can be found in Part 1 of our study).   Figure B6. Examples of 2-D Gaussian random fields, with zero mean and unit variance. These are used to sample the prior and posterior distributions. Note that we draw 1000 such samplers, and each of them is independent of the others. The 2-D random fields are only for the sampling of distributions. As for the sampling in randomized SVD, 1-D randoms are qualified enough, the generation of which is more cost-effective. Downloaded from https://academic.oup.com/gji/article-abstract/218/2/1100/5497300 by King Abdullah University of Science and Technology user on 29 July 2019     Figure B14. Square root values of Hessian diagonals in the elastic Checkerboard test. We plot the square-root form because the strict calculation of the data-misfit Hessian involves three cross-correlations between four Green's functions. In isotropic elastic media, each Green's function includes the effect of geometrical spreading. Thus the accumulated effect of fourfold geometrical spreading would become considerable. Figure B15. Two V P PSFs from the estimated Hessian. The 'delta-like perturbations' are located at the V P models, as indicated by the two 'crosses', respectively.