Estimating the model parameters from measured data generally consists of minimizing an error functional. A classic technique to solve a minimization problem is to successively determine the minimum of a series of linearized problems. This formulation requires the Fréchet derivatives (the Jacobian matrix), which can be expensive to compute. If the minimization is viewed as a non-linear optimization problem, only the gradient of the error functional is needed. This gradient can be computed without the Fréchet derivatives. In the 1970s, the adjoint-state method was developed to efficiently compute the gradient. It is now a well-known method in the numerical community for computing the gradient of a functional with respect to the model parameters when this functional depends on those model parameters through state variables, which are solutions of the forward problem. However, this method is less well understood in the geophysical community. The goal of this paper is to review the adjoint-state method. The idea is to define some adjoint-state variables that are solutions of a linear system. The adjoint-state variables are independent of the model parameter perturbations and in a way gather the perturbations with respect to the state variables. The adjoint-state method is efficient because only one extra linear system needs to be solved.
Several applications are presented. When applied to the computation of the derivatives of the ray trajectories, the link with the propagator of the perturbed ray equation is established.
One of the important tasks in data processing consists of determining model parameters from observed data. These tasks can be formulated as inverse problems, namely as the minimization of a functional, for instance the least-squares misfit between synthetic and observed data. In geophysics, this includes tomography, migration/inversion and automatic velocity analysis. When a local (descent) optimization technique, such as the conjugate gradient method, is used, the gradient of the functional is required, Gauthier et al. (1986), Liu & Bleistein (2001) and Mora (1989). The efficiency of the method greatly depends on the accuracy and efficiency of the computation of this gradient. With physical problems, the functional depends on so-called state variables. These state variables are the variables computed from the state equations, namely the equations that define the problem, sometimes called forward equations. For example, in the tomography problem, the state equations can be the ray equations, and the state variables the spatial coordinates and the slowness vectors describing the ray trajectories. The definition of the state variables depends on the mathematical formulation of the physical problem. The state equations depend on the model parameters. For the tomography problem this can be the background velocity (or slowness). The functional depends on those model parameters mainly through the dependency on the state variables.
The gradient of the functional, which depends on a set of state variables solutions of the forward equations, can be obtained with (a set of) the Fréchet derivatives of the state variables. The Fréchet derivatives are the derivatives of the state variables with respect to the model parameter. For instance, for the tomography problem described earlier, these derivatives are the derivatives of the spatial coordinates and the slowness vectors with respect to the slowness background. This gives the so-called Jacobian or sensitivity matrix. The Jacobian matrix can be used to linearize the functional, and the minimization problem can be solved by successively solving linearized problems using linear optimization techniques. However, the computation of the Fréchet derivatives can be expensive.
If non-linear optimization techniques, such as the non-linear conjugate method, are used, only the gradient of the functional may be needed, Gill et al. (1981). In the 1970s, a method based on the adjoint state has been introduced in the theory of inverse problems by Chavent (1974) to efficiently compute the gradient of a functional without the Fréchet derivatives. This approach originated from control theory, Lions (1972). Several authors in geophysics have applied this method, for instance, Lailly (1983), Bécache (1992), Chavent & Jacewitz (1995), Plessix et al. (1999) and Shen et al. (2003). The goal of this note is to review this method that is well known in the numerical community, to give a recipe for applying it based on an augmented functional also called associated Lagrangian, and to describe several examples demonstrating its practical use.
The adjoint-state method is a general method to compute the gradient of a functional that depends on a set of state variables, which are solutions of forward equations. The adjoint-state variables are the solutions of an adjoint linear system and can be seen as variables which gather a global measure of the perturbation of the problem with respect to the state variables. Numerically this approach is attractive because only one extra linear system needs to be solved and often the computation of the gradient with respect to the model parameters is equivalent to one or two evaluations of the forward modelling. This cost is often almost independent of the number of model parameters, which is not always the case when the Fréchet derivatives are computed. However the adjoint-state method does not provide the sensitivity of the solution to errors. For that, the Fréchet derivatives are needed or a Monte Carlo type of methods with a large number of forward computations is required.
The outline of the paper is the following. In a first section the adjoint-state variables are introduced from the perturbation theory. Then based on an augmented functional a recipe to systematically define the adjoint-state equations is described. In the three next sections, examples are given. The first one is the least-squares migration, Tarantola (1987). This is almost a school example. The functional is the least-squares misfit between the synthetics and the measured reflection seismic data. The adjoint states correspond to the backpropagated field and the gradient of this least-squares misfit is a migration operator, Lailly (1983) and Tarantola (1984). The second example is the computation of the gradient of the differential semblance optimization (DSO) functional, Symes & Carazzone (1991) and Shen et al. (2003). This example shows the power of the adjoint-state technique. The third example is the stereotomography, Billette & Lambaré (1998) and Lambaré et al. (2004). The link between the adjoint-state variables and the propagator of the differential equation defining the ray trajectory perturbations, Cervený (2001) and Farra & Madariaga (1987), is established.
The goal of this section is to explain the adjoint-state method for computing the gradient of a functional, J(m), when J depends on u(m). J is defined with the functional, h, by
A simple example is the linear case and , with d the observed data. This corresponds to the simple least-squares misfit with a linear forward problem. The physical realization is defined by u(m) = Am, with A a linear operator, in the discrete case a rectangular matrix.
The adjoint-state method from the perturbation theory
A perturbation, δm, of the model parameter, m, induces a perturbation, δu, of the physical realization, u, and a perturbation δJ of the error functional, J. u+δu should be a physical realization with the model parameter m + δm. Therefore, to the first order:
For the simple least-squares misfit, δJ = 〈u – d, δu〉U.
Assuming that for any model parameter m of M there exists a unique solution u of U, u + δu is the unique solution of F(u+δu, m+δm) = 0. Therefore, at the first order, δu is the unique solution of eq. (4) and can be written with the inverse of . This gives:
In the second line of eq. (6), the terms that do not depend on the perturbation δm have been gathered, the idea is to avoid the computation of the Fréchet derivatives, , because this can be expensive. Let us now define λ by
λ belongs to the dual space of U. It is called the adjoint-state variable and eq. (7) is the adjoint-state equation. This is a system of linear equations. The linear operator is the adjoint of the operator formed by the derivatives of the state equations (the mapping F) with respect to the state variables. The right-hand side consists of the derivatives of the functional, h, with respect to the state variables. In a sense the adjoint states gather the information on the perturbations of the state variables, viewed as independent variables. The computation of with eqs (7) and (8) is called the adjoint-state method.
The gradient of J can be computed either via the Fréchet derivatives of u with eqs (4) and (5) or via the adjoint-state method with eqs (7) and (8). The important differences between the two approaches are related to eqs (4) and (7). Indeed, on one hand, eq. (7) is independent of δm and needs to be solved only once. On the other hand, the right-hand side of eq. (4) depends on δm and this equation needs to be solved for each perturbation to obtain , namely M times if M is the number of elements in m. The computational time of the adjoint-state method is often almost independent of M, because the time to compute eq. (8) is often negligible compared with the time to solve eq. (7). This makes this approach very efficient. eq. (7) depends on the adjoint of evaluated at (u, m), this means that u should be completely known before solving it.
The adjoint-state equations can also be obtained with the use of an augmented functional, also called associated Lagrangian.
A recipe with the augmented functionaleq. (7) and is the adjoint-state equation. With this choice we retrieve the result of eq. (8).
can be also viewed as the Lagrangian associated with the minimization problem: find the minimum u of h(ũ,m) under the constraint F(u, m) = 0. The theory of optimization with equality constraints, Ciarlet (1989), tells us that u is the minimum, if (u, λ) is a saddle point of .λ are called the Lagrange multipliers. At the saddle point the derivatives of are equal to 0. the derivatives of with respect to ũ and are:
Therefore, gives the state equations and 0 gives the adjoint-state equations. And as seen previously. Notice again that when deriving with respect to m,ũ and are independent of m.
This link with the optimization theory is not needed to apply the adjoint-state method. For those familiar with this theory, it helps to recall the method. As in the optimization theory with equality constraints where one scalar Lagrange multiplier is associated with each scalar equation defining the constraints, one scalar adjoint state is associated with each scalar equation defining the mapping F in the augmented functional.
The computation of the gradient with the adjoint states can be summarized in the following recipe when u has been found from F(u, m) = 0:
For the linear case with the least-squares misfit .
The linear example with the least-squares misfit is a trivial example. In the next sections more complicated examples are described.
In this section, we formulate the migration as an inverse problem. The problem consists of minimizing with respect to the square of the slowness, the least-squares misfit between the synthetics, obtained by solving the wave equation, and the recorded (observed) reflection seismic data. The minimization of J should give the exact slowness. Unfortunately, in practice, J has many local minima, and a gradient optimization will only provide the best perturbation of the initial model inside a certain basin of attraction. This basin is generally not the basin of the global minimum, Gauthier et al. (1986). This is the reason why this problem is called the least-squares migration problem in this paper.
This example is a good example to understand how the adjoint-state method can be applied. It also allows us to redemonstrate that the gradient of J is a migration. This was discovered in the 1980s, Lailly (1983) and Tarantola (1984).
We will develop the computation for multiple sources and multiple receivers, first in the frequency domain because it is simple, then in the time domain.
In frequency domain, the wave equation operator reads: L = – ω2σ2−Δ, with σ the slowness. Note that the dependency on the spatial coordinates, x, is not written. The finite-difference discretization of Lus = fs with given boundary conditions leads to a complex linear system Marfurt (1984):eq. (21), contains n scalar equations.
The matrix A propagates the shot into the earth and us is the incident field originating at s. The adjoint of A propagates backward its source term, Lailly (1983). The source term, the right-hand side of eq. (25), is the sum over the receivers of the shot s of the residual between the synthetics and data. λs is then the backpropagation of the residual field.Clearbout (1985). A demonstration of this result without the adjoint-state method can be found in Plessix & Mulder (2004).
We here develop the same approach but in time domain. The application of the adjoint-state method is slightly more complicated because of the initial boundary conditions.
In the time domain us is real. For simplicity, we don't mention the spatial boundary conditions.eqs(30) and (31) we can now compute the derivatives with respect to and evaluate them at (u, λ) to obtain the adjoint-state equations:
The system (32) has final boundary conditions. To solve it the computation is done backwards from T to 0. To give a physical sense to the adjoint state and to interpret the integral, eq. (33), a new adjoint state, qs, is defined by a change of variables in the time axis:eq. (35) propagates the residual into the earth starting from the final time. qs is called the backpropagated field of the residual. The gradient of J now reads:
This result has been demonstrated by Lailly (1983).
Shot-based Differential Semblance Optimization
As explained in the introduction of the previous section, the least-squares formulation is not satisfactory to retrieve the long wavelength components of the velocity model (background) from reflection seismic data, because the least-squares misfit as a function of the background has many local minima. To reformulate the problem and obtain a larger basin of attraction for the global minimum, an idea is to exploit the fact that in the reflection seismic data the earth is seen through different angles of incident. If the background velocity is correct, the pre-stack migration of the data should give the same images, Al Yahya (1989). If the pre-stack migration gives different earth structures, this means that the background slowness used in the migration is erroneous. Several mathematical formulations of this idea have been proposed in the last 20 years, among them Chavent & Jacewitz (1995), Clément (2001), Plessix et al. (2000) and Symes & Carazzone (1991). In order to compute the gradient of the reformulated cost functions, the authors generally use the adjoint-state formulation because it is the most systematic method, without forgetting that they are mainly mathematicians.
As an example, I will describe the gradient computation of the DSO functional introduced in Symes & Carazzone (1991) for a common shot-based approach. The principle is to migrate each shot individually and then to differentiate the pre-stack migrated result with respect to the shot position for fixed points in the migrated images. If the derivative with respect to the shot position of the pre-stack migrated data is zero, it means that the migrated images are independent of the shot position, that is, of the angle of incident and that the background is correct. To obtain a global formulation, the DSO functional is used as a regularization of the least-squares functional.
Using a finite-difference scheme in frequency-domain, eq. (21), the incident wavefield, ui, due to the source function fi located at the shot position i satisfies:eq. (39) means , where x is a discretization point.
We recall that m is the squared slowness at the discretization point and all the state variables are differentiable functions with respect to m. The goal is to compute the gradient of J with respect to the squared slowness.
The forward equations, eqs (37), (38) and (39), depend on the state variables ui, vi and ri. ui, vi, ri are discretized on the same grid, therefore, ui, vi, ri belong to Cn, with n the number of discretization points. To define the augmented functional, we associate, for each shot with eq. (37), with eq. (38) and with eq. (39). and belong to Cn because eqs (37) or (38) or (39) define n scalar equations. These quantities depend on ω and x. The augmented functional reads with the state variables, , and the adjoint-state variables, :
This application shows the numerical interest of the adjoint-state method with the augmented functional. Indeed, a systematic use of the method automatically produces the result. Using the perturbation as described in the first section can lead to the same result, but its application is a bit more difficult and cumbersome, as shown in the Appendix. The physical interpretation of the adjoint states is difficult to find. We can notice that λr is just the perturbation with respect to r of the DSO functional, λu satisfies the adjoint wave equation and λv the wave equation.
The last example describes an application based on the ray equations. The functional depends on the traveltimes and on the other ray-based parameters. The derivatives of the traveltimes with respect to the velocity are efficiently computed by integrating the slowness perturbations along the ray. There is no real gain to introduce the adjoint states when only the derivatives of the traveltimes with respect to the velocity parameters are required, because the adjoint-state method does not give a more efficient algorithm. The derivatives of the ray trajectories can be evaluated from the paraxial ray equations and the propagator associated with this linear differential system, Cervený (2001) and Farra & Madariaga (1987). When the functional depends on the ray trajectories, the adjoint-state method provides a faster approach to the computation of the gradient of the functional. This case is illustrated with the stereotomography functional.
The purpose of the stereotomography, as described in Billette & Lambaré (1998), is to retrieve the velocity background not only from traveltimes picked on the seismic data but also from slopes of the locally coherent events in the common source and common receiver gathers. This approach differs from the classic traveltime tomography because the picks are interpreted independently from each other without any association to a given interface and they can represent either reflection or refraction events. The (observed) data are a set of source positions, xs, receiver positions, xr, two-way traveltimes, Tmsr, the slopes, ps, at the source locations, and the slopes, pr, at the receiver locations.
Following Billette & Lambaré (1998), the model parameters are xd the subsurface reflection points, θs and θr the take-off angles of the rays going towards the source, xs, and towards the receiver, xr, Ts and Tr the traveltimes along the rays from xd towards the source and the receiver and the parameters (vk) defining the continuous velocity field, v. The integration parameter along the ray is the time, t. The ray equations are:eq. (46), cT is a scalar coefficient and Cs and Cr are two diagonal matrices.
Gradient with the adjoint-state method
Gradient with the Fréchet derivatives
A more traditional approach to compute the gradient is to first determine the Fréchet derivatives. In this case, this means the derivatives of Ts, Tr, ys(Ts), and yr(Tr) with respect to Ts, Tr, xd, θs, θr, and (vk). A usual approach to compute those derivatives is to use the paraxial ray equations Cervený (2001) and Farra & Madariaga (1987). The perturbation δya of the rays ya is obtained from the propagator Pa defined byBillette & Lambaré (1998)
Relation between adjoint states and propagatoreq. (48). The adjoint state, λa, is then equal to (Ca is a diagonal matrix): (eq. 48) and the propagator system (eq. 50). Whereas the first one is a vectorial system, the second is a matrix system. This means that the adjoint-state system is d times smaller than the propagator system, with d = 4 in 2-D problems and d = 6 in 3-D problems. The adjoint-state method is then roughly d times faster. However, the adjoint-state method does not provide the Jacobian matrix, but only the gradient of J and the Jacobian may be used to determine the sensitivity of the solution to errors. The minimization should rely on non-linear optimization techniques, such as non-linear conjugate or quasi-Newton methods, Gill et al. (1981). In Billette & Lambaré (1998), the authors solve the non-linear optimization problem, by successively solving the linear problems defined by the Jacobian matrices.
The adjoint-state method for the gradient computation of a functional has been reviewed. The technique applies when the functional depends on the model parameters through a set of state variables, solutions of forward equations. The method consists of the computation of one unique extra linear system. The linear operator is formed with the adjoint of the operator defined by the derivatives of the forward model with respect to the state variables and the second member consists of the derivatives of the functional with respect to the state variables. The adjoint-state variables are the solution of this linear system. In a sense, they gather the information of the perturbations with respect to the state variables, assuming that the state variables are independent variables. Since this linear system is independent of the derivatives with respect to the model parameters, the adjoint states have to be computed only once, making the method numerically very efficient. The gradient of the functional with respect to the model parameters is now simply the scalar product between the adjoint states and the derivatives of the forward model with respect to the model parameters.
To form this extra linear system (the adjoint-state system) a recipe based on an augmented functional has been reviewed. This provides a systematic approach. The main step in the definition of this augmented functional is to consider the state variables and the adjoint-state variables as independent variables.
Several examples have been described to show the power of this approach. For the complicated example with the DSO functional, the adjoint-state equations have been derived from perturbation theory. This shows that the use of the augmented functional is not strictly necessary but simplifies the approach.
When the forward equations are the ray equations, the transpose of the propagator of the perturbed ray equations is the propagator of the adjoint equations. The adjoint-state method is computationally more efficient because the adjoint-state system is a vectorial system whereas the system of the propagator is a matrix system. Nevertheless, the adjoint-state method only gives the gradient, and not the Fréchet derivatives. The optimization problem should be solved with a non-linear optimization method, such a quasi-Newton or non-linear conjugate gradient technique.
I would like to thank Gilles Lambaré from Ecole des Mines de Paris and Colin Perkins from Shell for their comments on the manuscript.
Appendix: Dso Gradient with Perturbation Approach
In this appendix, we retrieve the gradient of the DSO function directly from the perturbation approach without the help of the augmented functional. If the model, m, is perturbed by δm, the wavefields, ui, are perturbed by δui, the backpropagated wavefields, vi, are perturbed by δvi, the migrated images, ri, are perturbed by δri, and the functional J by δJ. From eq. (40) we obtain
The complicated part consists in defining the adjoint-state equations. However with some experience, this is possible. For this example, we first rewrite the second term of eq. (A1)
By replacing δvi by its values in the second term of eq. (A6) we obtaineq. (A6) we obtain eqs (A4) and (A9) are the adjoint-state equations. They are the same that the ones obtained with the use of the augmented functional in the section on the DSO function, however the derivation is less obvious and systematic in this appendix.