Maximum-likelihood estimation of lithospheric flexural rigidity, initial-loading fraction, and load correlation, under isotropy

Topography and gravity are geophysical fields whose joint statistical structure derives from interface-loading processes modulated by the underlying mechanics of isostatic and flexural compensation in the shallow lithosphere. Under this dual statistical-mechanistic viewpoint an estimation problem can be formulated where the knowns are topography and gravity and the principal unknown the elastic flexural rigidity of the lithosphere. In the guise of an equivalent"effective elastic thickness", this important, geographically varying, structural parameter has been the subject of many interpretative studies, but precisely how well it is known or how best it can be found from the data, abundant nonetheless, has remained contentious and unresolved throughout the last few decades of dedicated study. The popular methods whereby admittance or coherence, both spectral measures of the relation between gravity and topography, are inverted for the flexural rigidity, have revealed themselves to have insufficient power to independently constrain both it and the additional unknown initial-loading fraction and load-correlation fac- tors, respectively. Solving this extremely ill-posed inversion problem leads to non-uniqueness and is further complicated by practical considerations such as the choice of regularizing data tapers to render the analysis sufficiently selective both in the spatial and spectral domains. Here, we rewrite the problem in a form amenable to maximum-likelihood estimation theory, which we show yields unbiased, minimum-variance estimates of flexural rigidity, initial-loading frac- tion and load correlation, each of those separably resolved with little a posteriori correlation between their estimates. We are also able to separately characterize the isotropic spectral shape of the initial loading processes.

equivalence of the results when either method was applied in a "consistent" formulation, taking into account the finite window size of any patch of available data. Still, large differences remained, experiments on synthetic data showed significant bias and large variance, and a clear consensus failed to arise. Macario et al. (1995), McKenzie (2003) and  investigated the effect of the statistical correlation between surface and subsurface loads. For their part, Diament (1985), Lowry & Smith (1994), Simons et al. (2000Simons et al. ( , 2003, Ojeda & Whitman (2002), Kirby & Swain (2004, 2008a and Audet & Mareschal (2007) focused on the spectral estimation of admittance and coherence via maximum-entropy, multitaper and wavelet-based methods, and identified the spectral bias, leakage and variance inherent in those. Much as the controversy involved the geological consequences of a thick versus a thin lithosphere, with only gravity and topography as the primary observations and no significant divergence in viewing the physics of the problem, that is, of elastic flexure in a multilayered system, over time the arguments evolved into a debate that was mostly about spectral analysis. Least-squares fitting of admittance and coherence functions, however determined, had become synonymous with the process of elastic-thickness determination.
The appropriateness of using least squares is not something that can be taken for granted but rather needs to be carefully assessed, as was pointed out early on in this context by Dorman & Lewis (1972), Banks et al. (1977), Stephenson & Beaumont (1980) and Ribe (1982), which, however, also focused on other issues that have since received more attention. Admittance and coherence are "statistics": functions of the data with non-Gaussian distributions even if the data themselves are Gaussian (Munk & Cartwright 1966;Carter et al. 1973;Walden 1990; Thomson & Chave 1991;Touzi & Lopes 1996;Touzi et al. 1999). Estimators for flexural rigidity based on any given method have their own distributions, though not necessarily ones with a tractable form. Without knowledge of the joint properties of admittance-and coherence-based estimators it is impossible to assess the relative merits of any method for a given data set or true parameter regime; with current state-of-the-art understanding it is not even clear if the two methods are statistically inconsistent.
At this juncture this paper aims for a return to the basics, by asking the question: "What information does the relation between gravity and topography contain about the (isotropic) strength of the elastic lithosphere?" and by formulating an answer that returns the full statistical distribution of the estimates derived from such data. As such, it should provide a framework for the interpretation of the early work on which we build: as others before us we are merely using the measurable ingredients of gravity, topography and the flexure equations. However, as we shall see, we do not need to consider this a two-step process by which first the transfer function needs to be estimated non-parametrically and then the inversion for structural parameters performed with the estimated transfer function as "data". This approach amounts to a loss of most of the degrees of freedom in the data, replacing them with spectral ratios estimated at a much smaller set of wavenumbers, and with much of the important information on the flexural rigidity compromised due to lack of resolution at the low wavenumbers. Rather, we can treat it as an optimization problem that uses everything we know about gravity and topography available as data to directly construct a maximum-likelihood solution for the lithospheric parameters of interest. These are returned together with comprehensive knowledge of their uncertainties and dependencies, and with a statistical apparatus to evaluate how well they explain the data; the analysis of the residuals then informing us where the modeling assumptions were likely violated. By the principle of functional invariance the maximum-likelihood solution for elastic thickness and loading ratio also returns the maximum-likelihood estimates of the coherence and admittance themselves, which can then be compared to those obtained by other methods. Admittance may be superior to coherence, or vice versa, in particular scenarios, but only maximum-likelihood, by definition, produces solutions that are preferred globally for all parameter regimes (Pawitan 2001;Severini 2001;Young & Smith 2005). Finally, we note that understanding the likelihood is also a key component of fully Bayesian solution approaches (e.g. Mosegaard & Tarantola 1995;Kaipio & Somersalo 2005).

B A S I C F R A M E W O R K
Despite their singular focus on deriving density profiles to reconstruct the portion of the Bouguer gravity field that is linearly related to the topography and thereby "explain" the isostatic compensation of surface topography to first order, even when the strength of the lithosphere had to be effectively prescribed, Dorman and Lewis' Experimental Isostasy 1, 2 and 3 contained virtually all of the elements of the analysis of gravity and topography by which the problem could be turned around to the, in the words of Lambeck (1988) "vexing", question "What is the flexural strength of the lithosphere"? The elements applicable to the analysis were the expressions for admittance and coherence between topography and the Bouguer, free-air, and isostatic residual gravity anomalies, the averaging or smoothing required to statistically stabilize the estimate of the transfer function that is the intermediary between the data and the model obtained by inversion for the unknown parameters (if not the density distribution, then the mechanical properties of the plates), the notion of correlated and uncorrelated noise of various descriptions: indeed all of the ingredients that will form the vernacular of our present contribution. In this section we redefine all primary quantities of interest in a manner suitable for the statistical development of the problem.
We treat Earth locally as a Cartesian system. Our chosen coordinate system has x = (x1, x2) in the horizontal plane and definesẑ pointing up: depths in Earth are negative. A density contrast located at interface j is found at depth zj ≤ 0, and is denoted ∆j = ρj − ρj−1. (1) Two layers is the minimum required to capture the full complexity of the general problem which may, of course, contain any number of layers. In a simple two-layer system, the first interface, at z1 = 0, is the surface of the solid Earth, and ρ0 is the density of the air (or water) overlying it. The density of the crust is ρ1, and the second interface, at z2 ≤ 0, separates the crust from the mantle with density ρ2. For now we use the term "topography" very generally to describe any departure from flatness at any surface or subsurface interface. By "gravity" we mean the "anomaly" or "disturbance"; both are differences in gravitational acceleration with respect to a certain reference model. These departures in elevation and acceleration are all small: we consider topography to be a small height perturbation of a constantdepth interface, and neglect higher-order finite-amplitude effects on the gravity. We always assume that the "loads", the stresses exerted by the topography, occur at the density interfaces and not anywhere else. If not in the space domain, x, we will work almost exclusively in the The initial loads were generated from the Matérn spectral class with parameters σ 2 , ρ and ν; they were not correlated, r = 0, and the spectral proportionality was f 2 . Also shown, by the black line, is the Bouguer gravity anomaly, G •2 . The density contrasts used were ∆ 1 = 2670 kg m −3 and ∆ 2 = 630 kg m −3 , respectively. All symbols and processes are clarified in the text. They will furthermore be identified and briefly explained in Table 1. Fourier domain, using the wave vector k or wavenumber (spatial frequency) k = k . We only distinguish between both domains when we need to, and then only by their argument. All of this corresponds to standard practice (Watts 2001).
Looking ahead we draw the readers' attention to Fig. 1, which contains a graphical representation of the problem. Fig. 1 is, in fact, the result of a data simulation with realistic input parameters. Many of the details of its construction remain to be introduced and many of the symbols remain to be clarified. What is important here is that we seek to build an understanding of how, from the observations of gravity and topography, we can invert for the flexural rigidity of the lithosphere in this two-layer case. The observables (rightmost single panel) are the sum of the flexural responses (middle panels) of two initial interface-loading processes (leftmost panels) which have occurred in unknown proportions and with unknown correlations between them.

Spatial and spectral representation, theory and observation
Writing H and G without argument we will be referring quite generically to the random processes "topography" and "gravity" respectively, though when we consider either physical quantity explicitly in the spatial or spectral domain we will distinguish them accordingly as H(x) or G(x) (in space), and dH(k) or dG(k) (in spectral space), where they depend on spatial position x or on wave vector k, respectively. In doing so we use to the Cramér (1942) spectral representation under which dH(k) and dG(k) are well-defined orthogonal increment processes (Brillinger 1975;Percival & Walden 1993), in the sense that at any point in space we may write H(x) = e ik·x dH(k) and G(x) = e ik·x dG(k).
We make the assumption of stationarity such that for every point x under consideration all equations of the type (3) are statistically equivalent. We further assume that both processes will be either strictly bandlimited or else decaying very fast with increasing wavenumber k = k such that we may restrict all integrations over spectral space to the Nyquist plane k ∈ [−π, π]×[−π, π]. While this is certainly a geologically reasonable assumption we would at any rate be without recourse in the face of the broadband bias and aliasing that would arise unavoidably if it were violated. For simplicity x maps out a rectangle that can be sampled on an M × N ≈ 2K grid given by In the non-rarified world of geophysical data analysis we will not be dealing with stochastic processes directly, rather with particular realizations thereof. These are our gravity and topography data, observed on finite domains, to which we continue to refer as H(x) and G(x). The modified Fourier transform of these measurements, obtained after sampling and windowing with a certain function wK (x), is In this expression WK(k) is the unmodified Fourier transform of the energy-normalized applied window, The spectral density or spectral covariance of continuous stationary processes is defined as the ensemble average (denoted by angular brackets) whereby we denote complex conjugation with an asterisk and δ(k, k ′ ) is the Dirac delta function. There can be no covariance between non-equal wavenumbers if the spatial covariance matrix is to be dependent on spatial separation and not location, as from eqs (3) and (7) H In contrast to eq. (7), as follows readily from eqs (5) and (7), the covariance between the modified Fourier coefficients of the finite sample is Eqs (5) and (9) show that the theoretical fields dH(k) and their spectral densities SHH(k) are out of reach of observation from spatially finite sample sets. Spectrally we are always observing a version of the "truth" that is "blurred" by the observation window. Even if, or rather, especially when the windowing is implicit and only consists of transforming a certain rectangle of data, this effect will be felt. For example, whereas the true spectral density is obtained by Fourier transformation of the covariance at all lags, denoted by the summed infinite series a blurred spectral density is what we obtain after observing only a finite set, denoted by the summed finite series with |FK | 2 denoting Fejér's kernel (Percival & Walden 1993). The design of suitable windowing functions (in this geophysical context, see, e.g., Simons et al. 2000Simons et al. , 2003Simons & Wang 2011), is driven by the desire to mold what we can calculate from the observations into estimators of these "truths" that are as "good" as possible, e.g. in the minimum mean-squared error sense; we will keep the windows or tapers wK (x) and the convolution kernels WK(k) generically in all of the formulation. For the gravity observable, whose spectral density is denoted SGG, we find the modified Fourier coefficients and the spectral covariance, respectively, as Finally, we will need to sample H(k), WK(k), and G(k) on a grid of wavenumbers. Exploiting the Hermitian symmetry that applies in the case of real-valued physical quantities, for an M × N data set we select the half-plane consisting of the K = M × (⌊N/2⌋ + 1) wave vectors The quantities H(k), WK(k), and G(k) are complex except at the dc wave vectors (0, 0) and the Nyquist wave vectors (0, π), (−π, 0) and (−π, π) if they exist in eq. (14), which depends on the parity of M and N .

Topography
As mentioned before, we apply the term "topography", H, generically to any small perturbation of the Cartesian reference surface, which is assumed to be flat. Specifically, we need to distinguish between what we shall call 'initial', 'equilibrium' and 'final' topographies, respectively. In the classic multilayer loading scenario reviewed by, e.g., McKenzie (2003) and Simons et al. (2003), as the jth interface gets loaded by an initial topography, the singly-indexed quantity Hj, a configuration results in which each of the interfaces expresses this loading by assuming an equilibrium topography, which is identified as the double-indexed quantity Hij. The first subscript refers to the interface on which the initial loading occurs; the second to the interface that reflects this process. The state of this equilibrium is governed by the laws of elasticity, as we will see in the next section. All of these equilibrium configurations combine into what we shall call the final topography on the jth interface, namely H•j, where the • is meant to evoke the summation over all of the interfaces that have generated initial-loading contributions.
Thus, in a two-layer scenario, what in common parlance is called "the" topography, i.e. the final, observable height of mountains and the depth of valleys expressed with respect to a certain neutral reference level, will be called H•1, and this then will be the sum of the two unobservable components H11 and H21. In other words, the final "surface" topography is Likewise, the final "subsurface" topography, H•2, is given by the sum of two unobservable components H12 and H22, totaling This last quantity, H•2, is not directly observable but can be calculated from the Bouguer gravity anomaly, as we describe below. Both H11 and H12 refer to the same geological loading process occurring on the first interface but being expressed on the first and second interfaces, respectively. In a similar way, H21 and H22 refer to the process loading the second interface which thereby produces topography on the first and second interfaces, respectively. While postponing the discussion on the mechanics to the next section it is perhaps intuitive that a positive height perturbation at one interface creates a negative deflection at another interface: "mountains" have "roots", as has been known since the days of Airy (Watts 2001). The initial-loading topography, then, is given by the difference between these two equilibrium components. At the first and second interfaces, respectively, we will have for the initial topographies at the surface and subsurface, respectively, The sum of all of the equilibrium topographies, at all of the interfaces in this system and thus requiring two subscripts •, is given by which is a quantity that we can only access through the free-air gravity anomaly that it generates, as we shall see.

Flexure
Mechanical equilibrium exists between H11 and H12 on the one hand, and H21 and H22 on the other. The equilibrium refers to the balance between hydrostatic driving and restoring stresses, which depend on the density contrasts, and the stresses resulting from the elastic strength of the lithosphere. Introducing the flexural rigidity D, in units of Nm, we obtain the biharmonic flexural or plate equation (Banks et al. 1977;Turcotte & Schubert 1982) as follows on the first (surface) interface: and at the second (subsurface) level, we have The mechanical constant D is the objective of our study: geologically, this yields to what is commonly referred to as the "integrated strength" of the lithosphere, which can be usefully interpreted under certain assumptions as an equivalent or "effective" elastic thickness. This quantity, Te, in units of m, relates to D by a simple scaling involving the Young's modulus E and Poisson's ratio, ν, as is well known (e.g. Ranalli 1995;Watts 2001;Kennett & Bunge 2008). Here we follow these authors and simply define Much has been written about what Te really "means" in a geological context (Lowry & Smith 1994;Burov & Diament 1995;Lowry & Smith 1995;McKenzie & Fairhead 1997;Burov & Watts 2006). This discussion remains outside of the scope of this study. Moreover,eqs (20) are the only governing equations that we shall consider in this problem. It is not exact (e.g. McKenzie & Bowin 1976;Ribe 1982), it is not complete (e.g. Turcotte & Schubert 1982), and it may not even be right (e.g. Karner 1982;Stephenson & Lambeck 1985;McKenzie 2010). For that matter, a single, isotropic D may be an oversimplification (Stephenson & Beaumont 1980;Lowry & Smith 1995;Simons et al. 2000Simons et al. , 2003Audet & Mareschal 2004;Swain & Kirby 2003b;Kirby & Swain 2006). However, the neglect of higher-order terms, additional tectonic terms in the force balance, time-dependent visco-elastic effects and elastic anisotropy remain amply justified on geological grounds. It should be clear, however, that any consideration of such additional complexity will amount to a change in the governing equations (20), which we reserve for further study.
At the surface, eq. (20a) is solved in the Fourier domain as where we have defined the dimensionless wavenumber-dependent transfer function baptized by Forsyth (1985) ξ At the subsurface, eq. (20b) has the solution with the dimensionless filter function All of the physics of the problem is contained in the equations in this section. As a final note we draw attention to the assumption that the interfaces at which topography is generated and those on which the resulting deformation is expressed coincide: this is the first of the important simplifications introduced by Forsyth (1985). This assumption, though not universally made (e.g. McNutt 1983;Banks et al. 2001), is broadly held to be valid. Finding D in this context is the estimation problem with which we shall concern ourselves.

Gravity
Every perturbation from flatness by topography generates a corresponding effect on the gravitational acceleration when compared to the reference state. We relate the gravity anomaly to the disturbing topography by the density perturbation ∆j and account for the exponential decay of the gravity field from the depth zj ≤ 0 where it was generated. The "free-air" gravitational anomaly (Hofmann-Wellenhof & Moritz 2006) from the topographic perturbation at the jth interface that results from the ith loading process is given in the spectral domain by where G is the universal gravitational constant, in m 3 kg −1 s −2 , not to be confused with the gravity anomaly itself. Once again this equation is inexact in assuming local Cartesian geometry (Turcotte & Schubert 1982;McKenzie 2003) and neglecting higher-order finite-amplitude effects (Parker 1972;Wieczorek & Phillips 1998), but for our purposes, this "infinite-slab approximation" will be good enough. The observable free-air anomaly is the sum of all contributions of the kind (26), thus in the two-layer case The Bouguer gravity anomaly is derived from the free-air anomaly by assuming a non-laterally varying density contrast across the surface interface. It thus removes the gravitational effect from the observable surface topography (Blakely 1995), and is given by In this reduction, we have used eqs (26)- (27), (16) and (22). For simplicity we shall write the Bouguer anomaly as defining one more function, which acts like a harmonic "upward continuation" operator (Blakely 1995), At this point we remark that topography and gravity, in one form or another, are the only measurable geophysical quantities to help us constrain the value of D. The Bouguer anomaly G•2 is usually computed from the free-air anomaly G•• and the topography H•1, assuming a density contrast ∆1. Any estimation problem that deals with any combination of these variables should thus yield results that are equivalent to within the error in the estimate (Tarantola 2005), though whether the free-air or the Bouguer gravity anomaly is used as the primary quantity in the estimation process could have an effect on the properties of the solution depending on the manner by which it is found -a paradox that this paper will eliminate.

Observables, deconvolution, and loading
We are now in a position to return to writing explicit forms for the theoretical observables from whose particular realizations (the data), ultimately, we desire to estimate the flexural rigidity D. These are the final "surface" topography, given by combining eqs (15) and (24) as By analogy we shall write for the final "subsurface" topography that which we can obtain by "downward continuation" (Blakely 1995) of the Bouguer gravity anomaly. From eq. (32), or combining eqs (16) and (22) this quantity is then The dependence on the parameter of interest, the flexural rigidity D, is non-linear through the "lithospheric filters" φ and ξ. While both H•1 and H•2 can thus be "observed" (or at least calculated from observations) we are for the moment taciturn about the complexity caused by the potentially unstable inversion of the parameter χ (see also Kirby & Swain 2011). We return to this issue in Section 5. Combining eqs (17)-(18) with eqs (22)-(24) and then substituting the results in eqs (34)-(35) yields the equations that relate the observed topographies on either interface with the applied loads. Without changing from the expressions first derived by Forsyth (1985) these have come to be called the "load-deconvolution" equations ( with the inverse relationships given by It should be noted that when D = 0, in the absence of any lithospheric flexural strength, thus in the case of complete Airy isostasy, φξ = 1 at all wavenumbers, and no such solutions exist. In that case the problem of reconstructing the initial loads has become completely degenerate. Armed with these solutions we can solve for the equilibrium loads. Combining eqs (17)-(18) with eqs (22)-(24) returns usable forms for H11 and H22, and substituting the results back into eqs (22)-(24) returns H12 and H21, all in terms of the initial loads H1 and H2, as To complete this section we formulate the initial-loading stresses, in kgm −1 s −2 , at each interface as All variables that we have introduced up to this point are listed in Table 1, to which we further refer for units and short descriptions. We are now also in the position of further interpreting Fig. 1, once again drawing the readers' attention to the heart of the problem, which is the estimation of the single parameter, the flexural rigidity D, which is responsible for generating, from the initial loads (left), the equilibrium topographies (middle) whose summed effects (right) we observe in the form of "the" topography and the (Bouguer) gravity anomaly.

Admittance and coherence
Modeled after eq. (7), the Fourier-domain relation between the theoretical observable quantities that are the surface topography H•1 and the Bouguer gravity anomaly G•2 is encapsulated by the complex-valued theoretical Bouguer admittance, which we define as A quantity whose expression eliminates the dependence on the location of the first interface contained in the term χ of eq. (32) is the real-valued Bouguer coherence-squared, the Cauchy-Schwarz bounded quantity As illustrated by eqs (9)-(13), similarly, the values of either ratio when calculated using actual observations H•1 and G•2 or H•2, with or without explicit windowing, will be estimators for eqs (41) and (42), but will never manage to recover more than a blurred version of the true cross-power spectral density ratios that they are, and with an estimation variance that will depend on how the required averaging is implemented (Thomson 1982;Percival & Walden 1993). Despite the various attempts by many authors (Diament 1985;Lowry & Smith 1994;Simons et al. 2000Simons et al. , 2003Kirby & Swain 2004Audet & Mareschal 2007;Simons & Wang 2011) to design optimal data treatment, wavelet or (multi-)windowing procedures, with the common goal to minimize the combined effect of such bias or leakage and estimation variance, in the end this may result in a well-defined (non-parametric) estimate for coherence and admittance, but the actual quantity of interest, the flexural rigidity, D, still has to be determined from that. As we wrote in the Introduction, understanding the statistics of the estimators for D derived from estimates of coherence or admittance depends on fully characterizing their distributional properties: a daunting task that, to our knowledge, has never been successfully attempted. Without this, however, we will never know which method is to be preferred under which circumstance. Moreover, we will never be able to properly characterize the standard errors of the estimates except by exhaustive trial and error (see, e.g., Pérez-Gussinyé et al. 2004;Crosby 2007;Kalnins & Watts 2009) from data that are synthetically generated. This is no trivial task (Macario et al. 1995;Ojeda & Whitman 2002;Kirby & Swain 2008a,b, 2009; we return to this issue later. We have hereby reached the essence of this paper: our goal is to estimate flexural rigidity D from observed topography H•1 and gravity G•2; estimates based on inversions of estimated admittance and coherence have led to widely different results, a general lack of understanding of their statistics, and thus a failure to be able to judge their interpretation. We must thus abandon doing this via the intermediary of admittance Q• and coherence γ 2 • , and rather focus on directly constructing the best possible estimator for D from the data. This realization is not unlike that made in the last decade by the seismological community, where the inversion of (group velocity? phase velocity?) surfacewave dispersion curves or individual-phase travel-time measurements has made way for "full-waveform inversion" in its many guises (e.g. Tromp et al. 2005;Tape et al. 2007). There too, the model is called to explain the data that are actually being collected by the instrument, and not via an additional layer of measurement whose statistics must remain incompletely understood, or modeled with too great a precision. In cosmology, the power-spectral density of the cosmic microwave background radiation (Dahlen & Simons 2008) is but a step towards the determination of the cosmological parameters of interest (e.g. Jungman et al. 1996;Knox 1995;Oh et al. 1999

T H E S T A N D A R D M O D E L
The essential elements of a geophysical and statistical nature as they had been broadly understood by the late 1970s were reintroduced in the previous section in a consistent framework. In this section we discuss the important innovations and simplifications brought to the problem by Forsyth (1985). In a nutshell, in his seminal paper, Forsyth (1985) made a series of model assumptions that resulted in palatable expressions for the admittance and the coherence as defined in eqs (41) and (42), neither of which would otherwise be of much utility in actually "solving" the problem of flexural rigidity estimation from gravity and topography. The first two of these were already contained in eq. (20): loading and compensation occur discretely at one and the same set of interfaces, and the constant describing the mechanical behavior of the system is a scalar parameter that does not depend on wavenumber nor direction. The first assumption might be open for debate, and indeed alternatives have been considered in the literature (e.g. Banks et al. 1977Banks et al. , 2001, but reconsidering it would not fundamentally alter the nature of the problem. The second: isotropy of the lithosphere, which is certainly only a null hypothesis (see, e.g. Stephenson & Beaumont 1980;Bechtel 1989;Simons et al. 2000Simons et al. , 2003Swain & Kirby 2003b;Kirby & Swain 2006, and many observational studies that work on the premise that it must indeed be rejected), does require a treatment that is to be revisited but presently falls outside the scope of this work. To facilitate the subsequent treatment we restate the equations of Section 2.5 in matrix form.

Flexure of an isotropic lithosphere, revisited
We shall consider the primary stochastic variables to be the initial-loading topographies H1 and H2, respectively, and describe their joint properties, and their relation to the theoretical observable final topographies H•1 and H•2 by defining the spectral increment process vectors Subsequently, we express the process by which lithospheric flexure maps one into the other in the shorthand notation where the real-valued entries of the non-symmetric lithospheric matrices MD(k) and M −1 D (k) can be read off eqs (36)-(37) and the functional dependence on the scalar constant flexural rigidity D is implied by the subscript. We now define the (cross-)spectral densities between the individual entries in the initial-topography vector dH(k) as in eq. (7) by writing and form the spectral matrix S(k) from these elements using the Hermitian transpose as Lithospheric flexure transforms the spectral matrix of the initial topographies, S(k), to that of the final topographies, S•(k), defined as via the mapping implied by eqs (44) through (46). We specify We can now see that the theoretical admittance and coherence of eqs (41)-(42) can equivalently be written as which explains why so many authors before us have focused on admittance and coherence calculations as a spectral estimation problem.
To be valid spectral matrices of real-valued bivariate fields, the complex-valued S(k) and S•(k) only need to possess Hermitian symmetry, that is, invariance under the conjugate transpose, and be positive-definite, that is, have non-negative real eigenvalues. The spectral variances of the initial and final topographies at the individual interfaces, S11(k) ≥ 0 and S22(k) ≥ 0, both arbitrarily depend on k, but without dependence between k = k ′ . The only additional requirements are that S12(k) = S * 21 (k) and |S12(k)| 2 ≤ S11(k)S22(k). The general form of S(k) as a stationary random process can be rewritten with the aid of a coherency or spectral correlation coefficient, r(k), which expresses the relation between the components of surface and subsurface initial topography as This correlation coefficient is in general complex-valued as the two fields may be spatially slipped versions of one another. The representation is simply a most complete description of a bivariate random spectral process (Christakos 1992). Should we make the additional assumption of joint isotropy for all of the loads, the spectral matrices would both be real and symmetric, S(k) = S(k) and S•(k) = S•(k). In keeping with the notation from eq. (8), we would require a spatial covariance matrix to only depend on distance, not direction. With θ the angle between k and x − x ′ we would have the real-valued with J0 the real-valued zeroth-order Bessel function of the first kind. With S real, the spectral variances and covariances between top and bottom loading components would all be real-valued and so would the correlation coefficient r(k) = r(k). It is important to note that the isotropy of the fields individually does not imply their joint isotropy. Two such fields can be spatially slipped versions of one another, but with slippage in a particular direction the fields may remain marginally isotropic but their joint structure will not.

Correlation between the initial loads
Statistically, eqs (45) and (49) imply that the initial-loading topographies on the two interfaces are related spectrally as whereby H ⊥ 1 (x), the zero-mean orthogonal complement to H1(x), is uncorrelated with it at all lags. The interpretation of what should cause a possible "correlation" between the initial-loading topographies must be geological (McGovern et al. 2002;McKenzie 2003;Belleguic et al. 2005;Wieczorek 2007;). Erosion (e.g. Stephenson 1984;Aharonson et al. 2001) is typically amenable to the description articulated by eq. (52), though much work remains to be done in this area to make it apply to the most general of settings. Under isotropy of the loading, the implication is that the initial subsurface loading H2(x) can be generated from the initial surface loading H1(x) by a radially symmetric convolution operator p(x), By selecting the initial loads Hj as the primary variables of the flexural estimation problem, and not the equilibrium Hij or final loads H•j , we now have the correlation r between the initial loads to consider in the subsequent treatment. Geologically, this puts us in a bit of a quandary, since if eq. (52) holds, this can only mean that one loading process "follows the other in time", "reacting to it". However, the temporal dimension has not entered our discussion at all, and if it did, it would certainly make sense to choose the correlation between the equilibrium load on one and the initial load on the other interface as the one that matters. The linear relationship (44) between the loads renders these two viewpoints mathematically equivalent. Our definition of eq. (49) is chosen to be mathematically convenient because it is most in line with the choices to be made in the next section. Forsyth (1985) deemed correlations between surface and subsurface loads to be potentially important but he did not make the determination of the correlation coefficient (49) part of the estimation procedure for the flexural rigidity D, which was instead predicated on the assumption, his third by our count, that r(k) = 0. He did recommend computing the correlation coefficient between the initial loads via eq. (44), after the inversion for D, and using the results to aid with the interpretation (see, e.g., Zuber et al. 1989). Studies by Macario et al. (1995), Crosby (2007), Wieczorek (2007) and  have since shed more light on how to do this more quantitatively, but to our knowledge no-one has actually attempted to determine the best-fitting correlation coefficient as part of an inversion for flexural rigidity. Forsyth (1985) introduced the 'loading fraction' as the subsurface-to-surface ratio of the power spectral densities of the initial-loading stresses I2 and I1, and thus from eqs (39)-(40) and (45) we can write

Proportionality between the initial loads
This definition is fairly consistently applied in the literature (e.g. Banks et al. 2001), though McKenzie (2003 has preferred to parameterize by the fraction each of the loads contributes to the total, which is handy for situations with multiple interfaces (see ) and subsurface-only loading. Eq. (54) is a statement of proportionality of the power spectral densities of the initial loads, S2 and S1. With this constraint, which we identify as his fourth assumption, Forsyth (1985) was able to factor S11 out of the spectral matrix S in eq. (45), which as we recall from the previous section, by his third assumption had no off-diagonal terms, to arrive at simplified expressions for S• of eq. (46), which acquires off-diagonal terms through eq. (47), and ultimately for the admittance Q• and coherence γ 2 • in eq. (48). We revisit these quantities in the next section but conclude with the general form of the initial-loading spectral matrix that is implied by the definition of proportionality, which is With what we have obtained so far: flexural isotropy of the lithosphere, MD(k), correlation of the initial-loading processes, r(k), and proportionality of the initial-loading processes, f 2 (k), the spectral matrix (47) of the final topographies -those we measure -is given by where we have defined the auxiliary matrices We define both T and ∆T so that we can easily revert to a model of zero correlation, in which case ∆T = 0. Note that we are silent about the dependence on wavenumber by using the shorthand notation ξ and φ for the lithospheric filters (23) and (25), but have kept the full forms of the correlation coefficient r(k) and the loading ratio f 2 (k) to stress that they are in general functions of the wave vector as defined by eqs (49) and (54). In general r will be complex and of magnitude smaller than or equal to unity, and f 2 (and f ) will be real and positive.

Admittance and coherence for proportional and correlated initial loads
Via eqs (56)-(58) we have explicit access to the (cross-)spectral densities between the individual elements in the final-topography vector dH•, as required to evaluate eq. (46). We shall now consider those for the special case where both r(k) = r and f 2 (k) = f 2 are constants, no longer varying with the wave vector. Then, following eq. (48), we obtain simple expressions for the admittance and coherence that we shall further specialize to a few end-member cases for comparison with those treated in the prior literature. We hereby complete Table 1 to which we again refer for a summary of the relevant notation. The Bouguer-topography admittance, for correlated and proportional initial loads with constant correlation r and proportion f 2 , is Spectrally, this is a function of wavenumber, k, only, since the power spectra of the loading topographies, which both may vary (similarly, because of their proportionality) with the wave vector k have been factored out. This admittance can be complex-valued since the load correlation may be, unless the power spectra of the loading topographies are isotropic. At k = 0 the admittance yields the density contrast ∆1.
Assuming that the loads are uncorrelated but proportional simplifies the Bouguer-topography admittance to the familiar expression In scenarios where only top or only bottom loading is present, we get the original expressions (Turcotte & Schubert 1982;Forsyth 1985) where, as expected and easily verified, The Bouguer-topography coherence, for correlated and proportional initial loads with constant correlation r and proportion f 2 , is which, as the admittance, is a function of wavenumber k regardless of the power spectral densities of the loading topographies. Unlike the admittance it has lost the dependence on the depth to the second interface, z2, and it is always real, 0 ≤ γ 2 • ≤ 1. When the initial loads are uncorrelated but proportional the Bouguer-topography coherence is, as according to Forsyth (1985), simply This expression was solved by Simons et al. (2003) for the wavenumber at which γ 2 In the paper by Simons et al. (2003) eq. (66) appears with a typo in the leading term, which was briefly the cause of some confusion in the literature (Kirby & Swain 2008a,b). Fig. 2 displays the individual effects that varying flexural rigidity, loading fraction and load correlation have on the expected admittance and coherence curves. Regardless of the fact that much of the literature to this date has been concerned with the estimation of the admittance and coherence from the available data, and regardless of the justifiably large amount of attention devoted to the role of windowing and tapering to render these estimates spatially selective and spectrally free from excessive leakage; regardless, in summary, of any practicality to the actual methodology by which admittance and coherence are being estimated and how the behavior of their estimates affects the behavior of the estimated parameter of interest, the flexural rigidity, D, we show these curves to gain an appreciation of the complexity of the task at hand. No matter how well we may be able to recover the "true" admittance and coherence behavior, the issue remains that they need to be interpreted -inverted -for a model that ultimately needs, or can, return an estimate for D but also of the initial-loading fraction, f 2 , and also of the correlation coefficient, r. Each of these have distinct sensitivities but overlapping effects on the predicted behavior of the measurements: selecting one end-member model (top-loading or bottom-loading only, for example, or disregarding the very possibility of load correlation, or imposing a certain non-vanishing value on the loading fraction or load-correlation coefficient) remains but one choice open to alternatives, and constraining all three is a task that, thus far, nobody has successfully attempted. Fig. 2 serves as a visual reminder of the limitations of admittance-and coherence-based estimation. However much information these statistical summaries of the gravity and topography data contain, it is not easily accessible for navigation in the three-dimensional space of D, f 2 and r.

Load correlation, proportionality and the standard model
The expressions in the previous section show how difficult it is to extract the model parameters D, f 2 and r individually from admittance or coherence. Forsyth (1985) argued that coherence depends on f 2 much more weakly than admittance, but what is important for the estimation problem is how the three parameters of interest vary together functionally: whether they occur in terms by themselves or as products, in which variations of powers, and so on. The geometry of the objective functions used to estimate the triplet of parameters, together with the distribution of any random quantities the objective functions contain, determine the properties of the estimators. We return to the question of identifiability after we have presented the new maximum-likelihood estimation method. For that matter, Forsyth (1985) suggested ignoring the load correlation, setting r = 0, and finding an estimate for the flexural rigidity D using a constant initial guess for the loading fraction f 2 and the coherence modeled as γ 2 f in eq. (65), and then using eqs (37), (39)- (40) and (54) to compute a wavenumber-dependent estimate of f 2 , which can then be plugged back into eq. (65) as a variable, and iterating this procedure to convergence. However, this allows for as many degrees of freedom as there are "data", thereby running the risk that an ill-fitting D can be reconciled with the data by adjustment with a very variable f 2 . It is unclear in this context what "ill-fitting" or "very variable" should mean, and thus it is hard to think of objective criteria to accomplish this. McKenzie (2003) showed misfit surfaces for the (free-air) admittance for varying D and varying f 2 held constant over all wavenumbers. These figures show prominent trade-offs, suggesting a profound lack of identifiability of D and f 2 with such a method. Even more importantly, McKenzie (2003) emphasized the possibility of non-zero correlations between the initial loads, deeming those prevalent in many areas of low-lying topography, on old portions of the continents: precisely where the discrepancy between estimates for elastic thickness derived from different methods has been leading to so much controversy. As an alternative to the Forsyth (1985) method, McKenzie & Fairhead (1997) suggested estimating D and f 2 from the free-air admittance in the wavenumber regime where surface topography and free-air gravity are most coherent. The rationale for this procedure is that there might be loading scenarios resulting in gravity anomalies but not (much) topography, a situation not accounted for in the Forsyth (1985) model that can, however, be described by initial-load correlation. , most recently, discussed the differences between both approaches, only to conclude that neither estimates the complete triplet (D, f 2 , r) of parameters (rigidity, proportionality, correlation) without shortcuts. Once again the statistical understanding required to evaluate whether either of these techniques results in "good" estimators is lacking.
That the cause of "internal loads without topographic expression" can indeed be attributed to correlation in the sense of (49) can be readily demonstrated by considering what it takes for the final, observable, surface topography H•1 to vanish exactly. Solving eq. (36) or eq. (44) and using eqs (23) and (25) returns the conditions that the first and second initial topographies are related to each other as which, using eqs (45), (54) and (49), implies the following equivalent relations between them: This set of equations together with our model very strongly constrain both fields. Thus, as noted by McKenzie (2003) and others after him (Crosby 2007;Wieczorek 2007;), a situation of internal loading that results in no net final topography may arise when the initial-loading topographies are perfectly correlated, balancing one another according to eqs (67)-(68). We can find a more complete condition for this scenario by equating eqs (67) and (52), which returns an expression for the orthogonal complement dH ⊥ 1 ; when this is required to vanish non-trivially we obtain the seemingly more general condition Requiring that the final surface topography have a vanishing variance S•11, substituting eqs (56)-(58) into eq. (46), we need to satisfy The correlation coefficients in eqs (69)-(70) must be real-valued since all of the other quantities involved are. Both eq. (69) and eq. (70) should be equivalent, and together they imply eq. (68). We are thus left to conclude that for the observable surface topography to vanish, the correlation between initial surface and subsurface loading must be perfect and positive, r = 1. Solving the quadratic equation (70) for f yields real-valued results only when |r| 2 − 1 ≥ 0, thus r = 1 for positive but non-constant f , as expected. The above considerations have put perhaps unusually strong constraints on the spectral forms of the final topography H•1(k) or S•11(k). From eq. (3) we learn that in doing so, the spatial-domain observables H•1(x) can never be non-zero. On the other hand, an observed H•1(x) could be zero over a restricted patch without its Fourier transform or its spectral density vanishing exactly everywhere. Alternatively, it can be very nearly zero, and this may also practically hamper approaches based on admittance or coherence which contain (estimates of) the term S•11(k) in the denominator (see eq. 48). When the observed topography becomes small, higher-order neglected terms may become prominent. Furthermore, there may be mixtures of loads with and without topographic expression (McKenzie 2003). Speaking quite generally, there will be areas with some correlation between the initial loads, and we should take this into account in the estimation. Either one of the load correlation or load fraction may vary with wavenumber. What emerges from this discussion is that the isotropic flexural rigidity D, the initial-load correlation r(k), and the initial-load proportionality f 2 (k) should all be part of the "standard model" of flexural studies. The last two concepts were introduced by Forsyth (1985), even though he did not further discuss the case of non-zero correlation.
As we wrote in the first paragraph in this section, Forsyth's first assumption was that the depth of compensation and the depth of loading in fact coincide. He writes that the assumption of collocation of these hypothetical interfaces and their precise location at depth in Earth may well be the largest contributor to uncertainty in the estimates for flexural strength, but also that there may be a priori, e.g. seismological, information to help constrain the depth z2. Thus, much like the density contrasts ∆1 and ∆2, we will not include the depth to the second interface z2 as a quantity to be estimated directly. Rather, we will consider them known inputs to our own estimation procedure and evaluate their suitability after the fact by an analysis of the likelihood functions and of the distribution of the residuals.

M A X I M U M -L I K E L I H O O D T H E O R Y
Measurements of "gravity" and "topography", which we consider free from observational noise, can be interpreted as undulations, H•1 and H•2, of the surface and one subsurface density interface, with density contrasts, ∆1 and ∆2, located at depths z1 = 0 at z2 in Earth, respectively. Geology and "tectonics" produce initial topographic loads, H1 and H2, on these previously undisturbed interfaces. These are treated as a zero-mean bivariate, stationary, random process vector, dH, fully and most generally described by a spectral matrix, S(k), under the assumption that the higher-order moments of H(x) are not too prominent (Brillinger 1975). For this paper we assume isotropy of the loading process, S = S(k). The lithosphere is modeled as a coupled set of differential equations, whose action is described by the spectral-domain matrix MD, which depends on a single, scalar parameter of interest, the flexural rigidity D. Since our observations have experienced the linear mapping dH , and the objective is to recover D, we are led to study S•(k). This includes its off-diagonal terms, which depend on the correlation coefficient of the loads at either interface, −1 ≤ r(k) ≤ 1, recall r(k) ∈ R, and, under the assumption of proportionality of the initial-loading spectra, on a loading fraction, f 2 (k). As part of the estimation we will thus also recover information about the loading process S.
All previous studies in the geophysical context of lithospheric thickness determination have first estimated admittance and coherence, ratios of certain elements of S• whose estimators have joint distributions that have not been studied. These were then used in inversion for estimates of D whose statistics have remained unknown. In the remainder of this paper we construct a maximum-likelihood estimator sensu Whittle (1953), directly from the data "gravity" and "topography", and the "known" parameters ∆1, ∆2, and z2. The unknowns are D, r and f 2 , and, as we shall see shortly, three more parameters by which we guarantee isotropy of the loading process S through a commonly utilized functional form. That this is more ambitious than the original objectives by Forsyth (1985) and the modifications by McKenzie (2003) is because the reduction of the data to admittance or coherence obliterates information that we are able to recover in some measure. We study the properties of the new estimators and derive the distributions of the residuals. When the procedure is applied to actual data, these should tell us where to adjust the assumptions used in designing the model.

Choice of spectral parameterization, σ 2 , ν, ρ
In the above we have seen that the primary descriptor of what causes the observed behavior is the spectral matrix S(k) from which the initial interface-loading topographies are being generated. After the assumption of spectral proportionality of the loading at the two interfaces, the expressions for admittance and coherence no longer contain any information about this particular quantity, though of course the deviations of the observed admittance and coherence from the models discussed in Section 3.4 still might. However, this information is no longer in an easily accessible form. Furthermore, coherence and admittance are typically estimated non-parametrically: the infinitely many, or rather, 2K = M ×N dimensions of the data are reduced to a small number of wavenumbers at which they are being estimated, thus there is a loss of O(K) degrees of freedom. At the low frequencies, most tapering methods experience a further reduction in resolution, which is detrimental especially in estimating the value of thick lithospheres from relatively small data grids, as is well appreciated in the geophysical literature.
Here, we will simply parameterize the initial loading using a "red" model, thereby avoiding such a loss. We may consult Goff & Jordan (1988, Carpentier & Roy-Chowdhury (2007) or Gneiting et al. (2010) for such models. Here we do, however, make the very strong assumption of isotropy. This is unlikely to be satisfied in real-world situations, as spectral-domain anisotropy is part and parcel of all geological processes (Goff et al. 1991;Goff & Arbic 2010). Relaxing the isotropic loading assumption introduces considerable extra complications. Our reluctance to handle anisotropic loading situations stems from the fact that their estimation might be confused statistically with a possible anisotropy in the lithospheric response: we can thus not easily study one without studying the other.
At this point we collect the parameters that we wish to estimate into a vector. To begin with, the "lithospheric" parameters, flexural rigidity D, loading ratio f 2 and load correlation r are We denote a generic element of this vector as θL. For the spectrum of the initial-loading topographies we choose the isotropic Matérn spectral class, which has legitimacy in geophysical circles (Goff & Jordan 1988;Stein 1999;Guttorp & Gneiting 2006). We specify whose parameters we collect in the set with generic element θS. The third parameter, ρ, is distinct from the mass density, as will be clear from the context. The full set of parameters that we wish to estimate problem is contained in the vector whose general element we denote by θ. For future reference we define the parameter vector that omits all consideration of the correlation as Fig. 3 shows a number of realizations of isotropic Matérn processes with different spectral parameters. As can be seen the parameters σ 2 ("variance") and ρ ("range") impart an overall sense of scale to the distribution while ν ("differentiability") affects its shape (Stein 1999;Paciorek 2007).

The observation vectors, dH and H
In Section 2 we introduced the standard statistical point of view on stationary processes (Brillinger 1975;Percival & Walden 1993). We specified how this applies to a finite set of geophysical observations that can be defined in a two-layer system, which we revealed to be the various types of "topography" and "gravity", and which are mapped into one another by the differential equations describing "flexure". Subsequently, we introduced the matrix formalism that describes the connections between the various geophysical observables and the initial driving forces that produce them, which we used extensively in Section 3 to discuss the standard approach of determining the unknown parameters of the flexural differential equation and the relative importance and correlation of the loading processes acting across either layer interface, which are of geophysical interest (e.g. Forsyth 1985;McKenzie 2003). To address the problem of how to properly estimate these unknowns and their distribution, we now return to the statistical formalism espoused in Section 2.1 in order to clarify how the "theorized" geophysical observables, i.e. the spectral processes describing the various kinds of topography dH(k) and gravity anomalies dG(k) are being shaped into the "actual" observations. Those are the windowed Fourier transforms H(k) and G(k) of particular realizations of topography and gravity as we can calculate from finite spatial data sets H(x) and G(x) measured in nature. In the spectral domain we continue to distinguish by the choice of font the theory (calligraphic) from what we can actually calculate (italicized). In the spatial domain, there is no need to define anything but H(x) or G(x).

In theory: infinite length and continuous
We recall that the spectral matrix S•(k), given by eq. (56), of the vector of final, observable, topographies dH•(k) defined in eqs (43)-(47), is separable in the sought-after parameter vectors θS and θL by the factoring of the spectral density S11(k) of the initial-loading topographies, In writing eq. (76) we emphasize the wavenumber-only dependence of the "spectral" matrix S11(k), which is isotropic, but keep the full wavevector dependence of the "lithospheric" matrices T(k) and ∆T(k) to make sure they have the same dimensions as the data. However, in the case of isotropic loading both T(k) and ∆T(k) will also only depend on wavenumber, and they will both be real. We thus rewrite eqs (57)-(58) with the dependencies φ(k), ξ(k), r(k) and f 2 (k) implicit in this sense, The Cholesky decomposition reverts to the Cholesky decomposition of T(k) when r = 0. Explicit expressions appear in Appendix 9.1. Because of the above relationships the transformed quantities  The initial loads were generated from the Matérn spectral class with parameters σ 2 , ρ and ν; they were negatively correlated, r = −0.75, and the spectral proportionality was f 2 , as indicated in the legend. Also shown, by the black line, is the Bouguer gravity anomaly, G •2 . The density contrasts used were ∆ 1 = 2670 kg m −3 and ∆ 2 = 630 kg m −3 , respectively. All symbols are listed and explained in Table 1.
have a spectral matrix that is the 2×2 identity,

In actuality: finite length and discretely sampled
We now define the vector of Fourier-transformed observations, derived from the actual measurements in eq. (5) and in (13), through eq. (35), With WK(k) the Fourier transform of the applied window defined in eq. (6), and by comparison with eqs (9)-(13), the covariance In comparison to eq. (46) and eqs (56) or (76), the finite observation window introduces spectral blurring, the loss of separability of the spectral and lithospheric portions, and small correlations between wave vectors. These we ignored when writing the last, approximate equality, introducing the blurred quantity (for a specific window wK , as opposed to eqs 10-11 where we first used the overbar notation) We denote the Cholesky decomposition ofS• as such that the transformed variable has unit variance

In simulations: how to go from the continuous to the discrete formulation
Correctly generating a data set H• that is a realization from a theoretical spectral process dH• with the prescribed spectral density S• requires ensuring that when we observe a finite sample of it, and we form the (tapered) periodogram of this, we get the correctly blurred spectral density (Percival 1992;Chan & Wood 1999;Dietrich & Newsam 1993Thomson 2001;Gneiting et al. 2006) in our case eq. (84). Stability considerations require that should we simulate data on one discrete grid and then extract a portion on another discrete grid, we replicate the correct covariance structure everywhere in space and always produce the correct blurring upon analysis. Failure to acknowledge the grid properly at the simulation stage can lead to severely compromised results as will be readily experienced but has not always been consciously acknowledged in the (geophysical) literature (Peitgen & Saupe 1988;Robin et al. 1993). The method that we outline here is variously known as Davies & Harte (1987) or circulant embedding (Wood & Chan 1994;Craigmile 2003). Let us assume that we have a spatial grid x as in eq. (4), and a half-plane Fourier grid k as in eq. (14). On the K entries of the latter we generate (complex proper) Gaussian variables Z•(k) and then transform these as suggested by eqs (86)-(87), wherebyL• is the Cholesky decomposition expressed on the grid k, of eq. (84) calculated on a much finer grid k ′ . In other words, whereby |F (k)| 2 is the unmodified periodogram of the spatial boxcar function that defines the simulation grid. The convolution in eq. (89) is to be implemented numerically, with care taken to preserve the positive-definiteness of the result. We now define the discrete inverse Fourier transform of this particular set of variables for this fixed set of wave vectors k to be equal to the integral that we introduced in eq. (3), which holds, in fact, for any x ∈ R 2 , and is consistent with eq. (5) which holds for the area of interest picked out by the boxcar window. We generate synthetic data sets H•(x) via eqs (88)-(90): by this procedure the covariance between any two points x and x ′ in any portion of space identified as our region of interest is now determined to be which follows from eqs (90), (46) and (83) with the small correlations between wave vectors neglected, and using the notation introduced in eq. (51). Now eq. (91) is equal to the universal expression in eq. (8), consistent with eqs (10)-(12), and since the dependence is only on the separation x − x ′ , stationarity is guaranteed. With x = x ′ eq. (91) states Parseval's theorem: at every point in space the variance of H• is equal to all of its spectral energy. Of course in the isotropic case considered here, C0(x − x ′ ) = C0( x − x ′ ), depending only on distance. Should we now take the finite windowed Fourier transform of such synthetically generated spatial data H•(x) on a different spatial patch (e.g. a subportion from the master set), while using any arbitrary window or taper w K ′ (x), we will be seeing the correctly blurred version of the theoretical spectral density S•, as required to ensure stability. Indeed, when forming a new set of modified Fourier coefficients H ′ • (k), distinguished by a prime, their covariance now must be, as follows directly from eqs (92), (91) and (6), the blurred quantity which is exactly as we have wanted it to be consistent with eq. (83). We will continue to neglect the small correlations between wave vectors, but fortunately this will have limited impact (Varin 2008;Varin et al. 2011). Fig. 4 shows a realization of a simulation produced with the method just described. In contrast to Fig. 1 we now show the result of the case where the initial-loading topographies are indeed (negatively) correlated. Evidence for the loading correlation is not apparent to the naked eye.

The log-likelihood function, L
Conditioned upon higher-order moments of the space-domain data being finite (Brillinger 1975), their Fourier components are near-Gaussian distributed, and for stationary processes, there are no correlations between the real and imaginary parts of the Fourier transform, which are independent. Writing N for the Gaussian and N C for the proper complex Gaussian distributions (Miller 1969;Neeser & Massey 1993), and dropping more wave vector dependencies as arguments than before, the observation vectors H•(k) in eq. (82) and the rescaled Z•(k) of eq. (86) are thus characterized at each wave vector k by the probability density functions pH • = N C 0,S• , pZ • = N C (0, I) , Re{Z•} ∼ N 0, 1 2 I , and Im{Z•} ∼ N 0, 1 2 I .
As we have noted at the end of Section 2.1, at the Nyquist and zero wave numbers these quantities are real with unit variance. In so writing the observation vector is treated as a random variable, but we are interested in the likelihood of observing the particular data set at hand given the model, which for us means an evaluation at the data in function of the deterministic parameters σ 2 , ρ, ν, D, f 2 , r. This quantity,L(θ), receives contributions from each wave vector k that, once the number K of considered wave vectors is large enough, can be considered independent from one another (Dzhamparidze & Yaglom 1983). The log-likelihood is thus, up to a constant, given by the standard form While we know that there is in fact correlation between the termsL k (θ), only at very small sample sizes K will this produce inefficient estimators, as the accrued effects of the correlation diminish in importance with increasing sample sizes. At moderate to large sample sizes there is considerable gain in computational efficiency and no loss of statistical efficiency due to the fast spectral decay of the blurring kernel functions involved. Our objective function, the log-likelihood, remains simply the average of the contributions at each wave vector in the half plane. Of course eqs (96)-(97) contain the blurred spectral formsS•(k) that we defined in eq. (84), in acknowledgment of the fact that the variance experiences the influence from nearby wave vectors: the approximation made asymptotically is that of eq. (83), but eq. (84) is exact. While we cannot ignore this blurring for finite sample size and for the particular data tapers used to obtain the windowed Fourier transforms, for very large data sets and well-designed, fast-decaying, window functions (e.g. Simons & Wang 2011) the observation vectors H• will converge 'in law' (Ferguson 1996) to random variables H ′ • that are distributed as complex proper Gaussian with an unblurred variance, in which case we would simply write Working with this distribution is mathematically more convenient since all of the subsequent calculations can be done analytically, and, per eq. (76), separably in the lithospheric and spectral parameters, so we will adhere to it until further notice. In this case the log-likelihood is While algorithms for simulation and data analysis will be based on eq. (97), we will use eq. (100) to study the properties of the solution, ultimately (in Section 6 and Appendix 9.8) demonstrating why such an approach is justified. On par with eq. (100) we introduce an equivalent likelihood in whose formulation the correlation coefficient r does not appear, with the notation of eqs (74)-(75) and eqs (76)-(78), namelỹ

The maximum-likelihood estimator,θ
The gradient of the log-likelihood, the score function, is the vector with generic elements, never to be confused with the coherence functions (64)-(65), that we shall denote as Following standard theory (Pawitan 2001;Davison 2003) we define the maximum-likelihood estimate as that which maximizes L(θ), thuŝ θ is the vector of the maximum-likelihood estimate of the parameters, for which Contingent upon the requisite second order conditions being satisfied (Severini 2001), this is also assumed to be the global maximum of (100) in the range of parameters that θ is allowed to take. We now let θ0 be the vector containing the true, unknown values, and have a certain θ ′ lie somewhere inside a ball of radius θ − θ0 around it. Then we may expand the score with a multivariate Taylor series expansion, using the Lagrange form of the remainder, to arrive at the exact expression The random matrix F is the Hessian of the log-likelihood function, with elements defined by and an expected value −F , the Fisher 'information matrix', Hence the name 'observed Fisher matrix' which is sometimes used for the Hessian. If it is invertible we may rearrange eq. (105) and writê For this exponential family of distributions the random Hessian converges 'in probability' to the constant Fisher matrix This is more than a statement about means: the fluctuations of F about its expected value also become smaller and smaller. Thus, no matter where we evaluate the Hessian, at θ ′ or at θ0, both tend to the constant matrix F . The distributional properties of the maximumlikelihood estimatorθ can be deduced from eqs (108)-(109), which are also the basis for Newton-Raphson iterative numerical schemes (e.g. Dahlen & Simons 2008). We thus need to study the behavior of γ, F, and F . The symbols of the statistical apparatus that we have assembled so far are listed in Table 2.

The score function, γ
Per eqs (102)-(104) the derivatives of the log-likelihood function L vanish at the maximum-likelihood estimateθ. With our representation of the unknowns of our problem by the parameter sets θL and θS we are in the position to calculate the elements of the score function γ explicitly. We remind the reader that these are not for use in the optimization using real data sets where the blurred likelihoodL is to be maximized instead. In that case the scores ofL will need to be calculated numerically. However, the scores of the unblurred likelihood L that we present here will prove to be useful in the calculation of the variance of the maximum-blurred-likelihood estimator. Combining eqs (100) through (103) we see that the general form of the elements of the score function will be given by For the lithospheric and spectral parameters, respectively, we will have The explicit expressions can be found in Appendices 9.2-9.3. For completeness we note here that ∂S −1 11 /∂θS = −m θ S S −1 11 . To determine the sampling properties of the maximum-likelihood estimation procedure we use eqs (99)-(103) to make the identifications to obtain the standard result that the expectation of the score over multiple hypothetical realizations of the observation vector vanishes, as In the treatment that is to follow (Johnson & Kotz 1973), we will need to perform operations on multiple similar forms as in eq. (110), namely To facilitate the development for the second term in eq. (115) we use eq. (88), but again without the complications of spectral blurring, see eq. (80), and proceed by eigenvalue decomposition of the symmetric matrices L T where λ + θ (k) and λ − θ (k) are the two possibly degenerate eigenvalues of L T • A θ L• constructed by combining eqs (79) and (111)- (112), Since the matrix P θ is orthonormal, Z θ andZ θ are identically distributed and thus we find through eq. (96) that eq. (117) is a weighted sum of independent random variables, each exponentially distributed, χ 2 2 /2, with unit mean and variance. In summary, we have the convenient form for the contributions to the score (110) from each individual wave vector, Since m θ is nonrandom we thus have an expectation for the contributions to the score that confirms eq. (114), namely and a variance given by We also retain the useful expression Eq. (121) gave us the variance of the derivatives of the log-likelihood function with respect to the parameters of interest, which was written in terms of the eigenvalues of the non-random matrix L T • A θ L•. More specifically, for the variances of the scores in the lithospheric parameters θL in θL = [D f 2 r] T , we will find whereas for the variances of the scores in any of the three spectral parameters θS in θS = [σ 2 ν ρ ] T , judging from eq. (112), we will need the sum of the squared eigenvalues of −m θ S L T As to the covariance of the scores in the different parameters we use eqs (113)-(114) to write and thereby manage to equate the variance of the score to the expectation of the negative of its derivative, which should of course specialize to verify eq. (121), giving us two calculation methods for the variance terms. We do not consider any covariance between the scores at non-equal wave vectors. From eqs (110) and (119) we have learned that the full score γ θ is a sum of random variables γ θ (k) or indeed the |Z ± θ (k)| 2 , which belong to the exponential family. Between those we consider no correlations at different wave vectors, and eqs (120) and (126) have given us their mean and covariance, respectively. Lindeberg-Feller central limit theorems apply (Feller 1968), and so the distribution of the score γ θ will be Gaussian with mean zero and covariance Using eqs (126), (100) and (106)- (107) we can rewrite the above expression in terms of the diagonal elements of the Fisher matrix, We can summarize all of the above by stating that, for K sufficiently large, ignoring wave vector correlations, and through the Lindeberg-Feller central limit theorem, the vector with the scores in the individual parameters converges in law to what is distributed as

The Fisher information matrix, F
From the definition in eq. (107) we have that the elements of the Fisher matrix F are given by the negative expectation of the elements of the Hessian matrix F, which themselves are the second derivatives of the log-likelihood function L with respect to the parameters of interest θ. Per eq. (128) the Fisher matrix scales to the covariance of the score γ, and by combining eqs (123)-(124) with eq. (110) or, ultimately, eqs (121) and (127), we thus find a convenient expression for the diagonal elements of the Fisher matrix, namely which, for the spectral parameters specializes to the more easily calculated expression For the cross terms, rather than combining eqs (119) and (127), we proceed via eq. (128) and thus require expressions for the elements of the Hessian. From eqs (106) and (110) we derive that the general expression for the elements of the symmetric Hessian matrix are description eq. θ 0 the true, unknown, parameter set of the problem, consisting of lithospheric and spectral parameters (105) θ the maximum-likelihood estimate of the parameter set (104) θ, θ ′ generic occurrences of the parameter set (74) θ L the lithospheric parameter set of the estimation procedure, containing D, f 2 and r listed in Table 1 (71) θ S the spectral parameter set of the estimation procedure, containing σ 2 , ρ and ν listed below (73) θ the parameter set not including the correlation coefficient r S 11 the power spectral density of the top-loading process, here assumed to be isotropic (72) σ 2 the first quantity in the parameterized Matérn form of the spectral density S 11 , to be estimated (72) ρ the second quantity in the parameterized Matérn form of the spectral density S 11 , to be estimated (72) ν the third quantity in the parameterized Matérn form of the spectral density S 11 , to be estimated (  Unless we use it in the numerical optimization of the log-likelihood we only need the negative expectation of eq. (132), the Fisher matrix where we have used eq. (122). Of course, when θ = θ ′ , the general eq. (133) specializes to the special case (130) discussed before. Ultimately this equivalence is a consequence of eq. (126) which held that in expectation, the product of first derivatives of the log-likelihood is equal to its second derivative.
The explicit forms are listed in Appendix 9.4, but looking ahead, we will point to two special cases that result in simplified expressions. It should be clear from the separation of lithospheric and spectral parameters achieved in eq. (76) and from eqs (111)-(112) that the mixed derivatives of one lithospheric and one spectral parameter, ∂ θ L m θ S = ∂ θ S m θ L = 0 and ∂ θ L S11 = 0, both vanish, and that we thereby have Finally, we also easily deduce that where we have used the previously noted special case of eq. (122) by which The previously encountered eq. (131) is again a special case of eq. (135) when θS = θ ′ S . Both expressions (134) and (135) are of an appealing symmetry. Between them they cover the majority of the elements of the Fisher matrix, which will thus be relatively easy to compute.

Properties of the maximum-likelihood estimate,θ
We are now ready to derive the properties of the maximum-likelihood estimate given in eq. (108), which we repeat here, aŝ From eq. (129) we know that the score γ converges to a multivariate Gaussian, and from eq. (109) we know that the Hessian F converges in probability to the Fisher matrix F . A Taylor expansion allows us to replace θ ′ by θ0 as in standard statistical practice (Cox & Hinkley 1974). Thus, by Slutsky's lemma (Severini 2001;Davison 2003) the distribution ofθ is also a multivariate Gaussian. Its expectation will be showing how our maximum-likelihood estimator is unbiased. Its covariance is From eq. (128) we retain that Kcov{γ(θ0)} = F (θ0) and with F = F T a symmetric matrix, we conclude that the covariance of the maximum-likelihood estimator is given by In summary, we have shown that which allows us to construct confidence intervals on the parameter vector θ. Denoting the generic diagonal element of the inverse of the Fisher matrix evaluated at the truth θ0 as J θθ (θ0), this equation shows us that each element of the parameter vector is distributed as As customary, we shall replace the needed values θ0 with the estimatesθ and quote the 100×α % confidence interval on θ0 as given bŷ where zα is the value at which the standard normal reaches a cumulative probability of 1 − α, i.e. z α/2 ≈ 1.96 for a 95% confidence interval. These conclusions, which are exact for the case under consideration, will hold asymptotically when in practice we use the blurred likelihood (97) instead of eq. (100). In the blurred case and for all numerical optimization procedures, we expect to have to amend eqs (137) and (138) by correction factors on the order of K −1 and K −2 , respectively. Eq. (142) would receive extra correction terms starting with the order K −1 , which would be immaterial given the size of the confidence interval.
In some sense, eq. (142) concludes the analysis of our maximum-likelihood solution to the problem of flexural-rigidity estimation. It makes the important statement that each of the estimates of flexural rigidity D, initial-loading ratio f 2 , and load correlation coefficient r, will be normally distributed variables centered on the true values and with a standard deviation which will scale with the inverse square-root of the physical data size K. Obtaining the variance on the estimates of effective elastic thickness Te from the estimates of D will be made through eq. (21) via the "delta method" (Davison 2003). This implies that the estimate of the effective elastic thickness is approximately distributed as

Analysis of residuals
Once the estimateθ = [θLθS] T has been found, we may combine it with our observations, and through eq. (86), form the variablê which should be distributed as the standard complex proper Gaussian N C (0, I). Equivalently, and as a special case of eqs (97) and (117), and these variables should be approximately independent. We can rank order them according to their size, The examples are for a case where K = 2 × 32 × 32, ∆ 1 = 2670 kg m −3 , ∆ 2 = 630 kg m −3 , z 2 = 35 km, and the sampling intervals were 20 km in each direction. The true model is for the correlated case where the lithospheric parameters are D = 1 × 10 24 N m, f 2 = 0.8 and r = 0.75, and the spectral parameters σ 2 = 2.5 × 10 −3 , ν = 2, ρ = 4 × 10 4 . (Top row) A "bad" example where the residuals do not follow the predicted distribution and continue to show too much structure in the wave vector domain. For this example, the poor estimate is given byD = 1.5 × 10 24 N m,f 2 = 0.915 andr = 0.656,σ 2 = 1.9 × 10 −3 ,ν = 1.5,ρ = 4.25 × 10 4 , and the blurred log-likelihoodL = −18.591. (Bottom row) A "good" example which indicates that the estimate will be accepted as a fair representation of the truth, which in this case isD = 1.326 × 10 24 N m,f 2 = 0.790 andr = 0.741, σ 2 = 2.415 × 10 −3 ,ν = 2.00,ρ = 3.974 × 10 4 . The blurred log-likelihoodL = −18.2883. No structure is detected in the residuals: the model fits. and inspect the quantile-quantile plot (Davison 2003) whereby the X (j) 0 , for all j = 1, . . . , K, are plotted versus the inverse cumulative density function of the χ 2 4 /2 distribution, evaluated at the argument j/(K + 1). If, apart from at very low and very high values of j, this graph follows a one-to-one line, there will be no reason to assume that our model is bad for the data. This can then further be formalized by a chi-squared test (Davison 2003), but a plot of the residuals as a function of wave vector will be more informative to determine how the model is misfitting the data. In particular it may diagnose anisotropy of some form, or identify particular regions of spectral space that poorly conform to the model and for which the latter may need to be revised. Fig. 5 illustrates this procedure on a recovery simulation under correlated loading.
If the method holds up to scrutiny of this type, then because ours is a maximum-likelihood estimator, it will be asymptotically efficient, with a mean-squared error that will be as small or smaller than that of all other possible estimators, converging to the optimal estimate as the sample size grows to infinity.

Admittance and coherence return, briefly
The theoretical admittance Q• and coherence γ 2 • are nothing but one-to-one functions of our parameters of interest. Consequently (Davison 2003), maximum-likelihood estimates for either Q• or γ 2 • are obtained simply by evaluating the functions (59) or (64) at the maximumlikelihood estimate of the parameters. The equivalence is easy to appreciate by expanding the score in the desired function, e.g. γ 2 • , as a total derivative involving the parameters D, f 2 and r, The score in γ 2 • vanishes when ∂L/∂D = ∂L/∂f 2 = ∂L/∂r = 0 as long as each of ∂γ 2 • /∂D, ∂γ 2 • /∂f 2 and ∂γ 2 • /∂r are non-zero. Thus the maximum-likelihood estimates Q• and γ 2 • are obtained at the maximum-likelihood valuesD, f 2 andr, and are computed without difficulty, as we will illustrate shortly. See Appendix 9.5 for a few additional considerations.

T E S T I N G T H E M O D E L
In the previous section we discussed the question whether the "model" to which we have subscribed is at all "valid" in very general terms. Here, we will address two possible concerns more specifically. The main ingredients of our model are the flexural equations (20), correlation (49) and proportionality (54) of the initial topographies, and the isotropic spectral form (72) that we assumed for the loading terms. Other than that, we have introduced a certain fixed two-layer density structure ∆1, ∆2 and z2, and an approximate way of computing gravity anomalies by way of eq. (26). When working within this framework, we showed in Section 4.8 how to assess the quality of the data fit, and in Section 4.9 how to hindcast the traditional observables of admittance and coherence. However, what we have not addressed is the relative merits of alternative models. How appropriate is the Matérn class, especially in its isotropic form? How different would an analysis that does not consider correlated loading be from one that does? What would be the effect of modifying or adding additional terms to the flexural equations, as could be appropriate to consider more complex tectonic scenarios, elastic non-linearities, elastic anisotropy, or alternative rheologies (as, for example, Stephenson & Beaumont 1980;Stephenson & Lambeck 1985;Ribe 1982;Swain & Kirby 2003a;McKenzie 2010)? We cannot, of course, address all of these questions with any hope for completeness, but in this section we introduce two specific considerations that will speak to these issues.
The first, detailed in Appendix 9.6, involves a stand-alone methodology to recover the spectral parameters in the Matérn form given univariate multi-dimensional data. This will help us build well-suited data synthetics; it will also enable the study of terrestrial and planetary surfaces per se, e.g. to measure the roughness of the ocean floor or the lunar surface (e.g. Goff & Arbic 2010;Rosenburg et al. 2011). Even more broadly, it is an approach to characterize texture (Haralick 1979;Cohen et al. 1991) in the context of geology and geophysics. Although our chosen parameterization (72) permits a wide variety of spectral shapes, we are of course limiting ourselves by only considering isotropic loading models. In future work, anisotropic spectral shapes for the loading terms will be considered.
The second, in Appendix 9.7, is a worked example of how, specifically, the inclusion or omission of the initial-loading correlation coefficient, r, may influence the confidence that we should have in our maximum-likelihood estimates obtained with or without it. We might construct a likelihood L(θ), as in eq. (100) with all terms (76)-(78) present, or instead we might force the initial-loading correlation to r = 0. This would result in a simpler form that we have calledL(θ) in eq. (101), whereby the parameter r is lacking altogether from the vector to be compared with the expression for θ in eq. (74). Sinceθ ⊂ θ, both models are 'nested': the less complicated model can be obtained by imposing constraints on the more complicated model, so that the simpler model is a special case of the more complicated one. In that case the likelihood-ratio test (Cox & Hinkley 1974;Severini 2001) that we describe in Appendix 9.7 is applicable. It is inappropriate to compare models using likelihood ratios if they are not nested, even if special exceptions exist to that rule (see, e.g., Vuong 1989;Fan et al. 2001). What we have not done is incorporate the effect of downward continuation in eq. (35) into the analysis. The 'data' that we will generate and analyze in our synthetic experiments will have been 'perfectly' downward continued to the single 'appropriate' interface at depth, from 'noise-free' gravity observations, which remains a very idealized situation. Some problems anticipated with numerical stability might be remediated through dedicated robust deconvolution methods, but more generally, giving up this level of idealization for real-world data analysis will cause complications that require special treatments. Absent these, our theoretical error estimates will be minimum bounds. Keeping in mind that the complications of this kind are shared by other gravity-based methods, we feel justified in not exhaustively discussing all of our options here. Nevertheless, we can look ahead at addressing the downward continuation of the gravity field within the framework of our maximum-likelihood method by considering what would happen if we took the surface topography and the gravity anomaly as the primary observables, rather than the surface and (deconvolved) subsurface topography as we now have, in eq. (43). We would, essentially, continue to carry the factors χ(k) from eq. (35) throughout the development. In the application of the blurred data analysis (89) those factors would appear inside the convolutional integrals, to appear in Appendix 9.8, of the kind (236), and their appearance there would no doubt regularize the gravity deconvolution by stabilizing the inverse (237) and its derivatives (238) as actually used by the optimization algorithm. However, the variance expressions for the maximum-likelihood estimates, which we derive based on the unblurred likelihoods, would presumably be farther from their blurred equivalents once the deconvolution is also part of the estimation in this way, and it would require much detailed work to arrive at a complete understanding of such a procedure. At the end of the day, we would still not have remediated the geophysical problems of measurement and data-reduction noise in obtaining the Bouguer gravity anomalies, nor handled possible departures from the two-layer model that may exist in the form of internal density anomalies. The list of caveats is long but again shared among other gravitybased methods, over which the maximum-likelihood method has a clear advantage, as we have seen, theoretically, above, and are about to show, via simulation, in what follows.

N U M E R I C A L E X P E R I M E N T S
Numerical experiments are straightforward. We generate synthetic data using the procedure established in Sections 4.2.1-4.2.3, and then employ an iteration scheme along the lines of eqs (108)-(109): starting from an initial guess we proceed through the iterations k = 0, . . . aŝ until convergence. In practice any other numerical scheme, e.g. by conjugate gradients, can be used, the only objective being to maximize (or minimize the negative) log-likelihood (97) by whichever iteration path that is expedient, and for which canned routines are readily available.
The important points to note are, first, that we do need to implement the convolutional blurring step (89) in the generation of the data, so as to reference them to a particular generation grid while keeping the flexibility to subsample, section, and taper them for analysis as in the real-world case. Second, we do need to maximize the blurred log-likelihood (97) and not its unblurred relatives (100) or (101). The datageneration grid and the data-inversion grid may be different. If these two stipulations are not met, an "inverse crime" (Kaipio & Somersalo 2005Hansen 2010) will be committed, leading to either unwarranted optimism, or worse, spectacular failure -both cases unfortunately paramount in the literature and easily reproduced experimentally.
From the luxury of being able to do synthetic experiments we can verify, as we have, the important relations derived in this paper, e.g., the expectation of the Hessian matrices of eq. (107), the distribution of the scores in eq. (128), of the residuals in eq. (145), of the likelihood ratios in eq. (235) of the forthcoming Appendix 9.8, and of course virtually all of the analytical expressions listed in the Appendices. We can furthermore directly inspect the morphology of the likelihood surface (97) for individual experiments and witness the scaled reduction of the confidence intervals with data size predicted by eq. (142). Via eq. (147) we can compare coherence (and admittance) curves with those derived from perfect knowledge, and contrast them with what we might hope to recover from the traditional estimates of the admittance and coherence. We do stress again that even if we did have perfect estimates of admittance and coherence, the problem of estimating the parameters of interest from those would be fraught with all of the problems, encountered in the literature, that led us to undertake our study in the first place.
Most importantly, we can check how well our theoretical distributions match the outcome of our experiments. After all, in the real world we will only have access to one data set per geographic area of interest, and will need to decide on the basis of one maximum-likelihood estimate which confidence intervals to place on the solution, and which trade-offs and correlations between the estimated parameters to expect. We were able to derive the theoretical distributions only by neglecting the finite-sample size effects, basing our expressions on the 'unblurred' likelihood of eq. (100) when using eq. (97) would have been appropriate but analytically intractable. In short, we can see how well we will do under realistic scenarios, and check how much we are likely to gain by employing our approach in future studies of terrestrial and planetary inversions for the effective elastic thickness, initial-loading fraction and load-correlation coefficient.
Figs 1 and 3-5 were themselves outputs of genuine simulations to which the reader can refer again for visual guidance. Here we limit ourselves to studying the statistics of the results on synthetic tests with simulated data. In Figs 6-9 we report on two suites of simulations: one under the uncorrelated-loading scenario for two different data sizes in Figs 6-7, and one under correlated loading for two different data sizes in Figs 8-9. Histograms of the outcomes of our experiments are presented in the form of diffusion-based non-parametric 'kernel-density estimates' (Botev et al. 2010), which explains their smooth appearance. The distributions of the estimators are furthermore presented in the form of the quantile-quantile plots as introduced in eq. (146), which allows us to identify outlying regions of non-Gaussianity. Figures of the type of Fig. 5 should help identify problems with individual cases.
For the uncorrelated-loading experiments shown in Figs 6-7 there are few meaningful departures between theory and experiment. The predicted distributions match the observed distributions very well, and the parameters of interest can be recovered with great precision. Indeed, Fig. 6 shows us that an elastic thickness Te = 43.2 km on a 1260×1260 km 2 grid can be recovered with a standard deviation of 2.9 km, with similarly low relative standard deviations for the other parameters. Fig. 7, whose data grid is twice the size in each dimension, yields standard deviations on the estimated parameters that are half as big, in accordance with eq. (142). What is remarkable is that both theory and experiment, shown in Fig. 10, predict that the flexural rigidity D and the initial-loading ratio f 2 can be recovered without appreciable correlation between them, and with little trade-off between them and the spectral parameters σ 2 , ν and ρ, even though the trade-off between the spectral parameters themselves is significant. This propitious "separable" behavior is not at all what the entanglement of the parameters through the admittance and coherence curves shown in Fig. 2 would have led us to believe, and it runs indeed contrary to the experience with actual data as reported in the literature. The likelihood contains enough information on each of the parameters of interest to make this happen; the very act of reducing this information to admittance and coherence curves virtually erases this advantage by the collapse of their sensitivities.
For the correlated-loading experiments shown in Figs 8-9 the agreement between theory and experiment is equally satisfactory. The introduction of the load-correlation coefficient r contributes to making the maximum-likelihood optimization 'harder'. In our example we are nevertheless able to estimate an elastic thickness Te = 17.8 km on a 1260×1260 km 2 grid with a standard deviation of only 0.7 km, as shown in Fig. 9. In contrast, Fig. 8, whose data grid is half the size in each dimension, yields standard deviations on the estimated parameters that are about twice as big, in accordance with eq. (142). Fig. 11 shows the normalized covariance of the estimators.
In all of our experiments as reported here we implemented the finite-sample size blurring in the data analysis, but made predictions based on the unblurred likelihoods, as discussed before. The figures discussed in this section serve as the ultimate justification for the validity of this approach, with further heuristic details deferred to Appendix 9.8. When omitting the blurring altogether the agreement between theory and practice becomes virtually perfect. As we have argued, though, in those cases we commit the inverse crime of analyzing the data on the same grid on which they have been generated, which is unrealistic and needs to be avoided. We also note that in designing practical inversion algorithms, care should be taken in formulating an appropriate stopping criterion. The exactness of the computations should match the scaling of the variances with the data size, which we showed goes as 1/K in eq. (128). This is difficult to tune, and some synthetic experiments might inadvertently trim or 'winsorize' the observed distributions by setting too stringent a convergence criterion.
Figs 12 and 13, to conclude, show the distribution of estimates of the admittance and coherence for the entire set of experiments about which we have reported here. The maximum-likelihood estimates agree very well with the theoretical curves, although the effect of varying data size on the spread is understandably noticeable. Our initial misgivings about the traditional admittance and coherence estimates (obtained by Fourier transformation and averaging over radial wavenumber annuli) are well summed up by their behavior, which shows significant bias and large variance. While the bias can be taken into account in comparing measurements with theoretical curves, as it has been by various authors (Simons et al. 2000;Pérez-Gussinyé et al. 2004 geophysical parameters, will be deprived of the many benefits that a direct maximum-likelihood inversion brings and that we have attempted to illustrate in these pages.  Fig. 6 but now carried out on a 128×128 grid. This roughly halves the standard deviation of the estimates, implying a theoretical recovery of Te = 43.2 ± 1.4 km, f 2 = 0.8 ± 0.013, σ 2 = (2.5 ± 0.1) × 10 −3 , ν = 2 ± 0.029, ρ = (2 ± 0.0273) × 10 4 . As in Fig. 6, the experiments fit the theory encouragingly well.  Fig. 8 but now carried out on a 64×64 grid. This roughly halves the standard deviation of the estimates, implying a theoretical recovery of Te = 17.8 ± 0.7 km, f 2 = 0.4 ± 0.008, r = −0.75 ± 0.007, σ 2 = (2.5 ± 0.2) × 10 −3 , ν = 2 ± 0.061, ρ = (2 ± 0.0672) × 10 4 . As in Fig. 8, the experiments fit the theory very well. There is some trade-off between the lithospheric (D, f 2 ) and the spectral parameters (σ 2 , ν, ρ), but virtually none among the lithospheric parameters, while the spectral parameters cannot be independently resolved.  Fig. 10. The match between theory and experiment is on par with that found in the uncorrelated case. The covariances between parameters are similar in both cases, with a significant trade-off between the lithospheric parameter D and two of the spectral parameters, σ 2 and ρ, which, themselves, cannot be independently resolved. Underneath we show medians (black circles) and 2.5th and 97.5th percentile ranges (black bars) of two hundred "traditional" unwindowed Fourier-based estimates of admittance and coherence, highlighting the significant bias and/or high variance of such procedures. . The layout is as in Fig. 12. The "traditional" admittance and coherence estimates (black circles, medians, and 2.5th to 97.5th percentiles) once again show significant bias and/or variance, although the admittance can be estimated much more accurately than the coherence using conventional Fourier methods.

C O N C L U S I O N S
In this paper we have not answered the geophysical question "What is the flexural strength of the lithosphere?" but rather the underlying statistical question "How can an efficient estimator for the flexural strength of the lithosphere be constructed from geophysical observations?". Our answer was constructive: we derived the properties of such an estimator and then showed how it can be found, by a computational implementation of theoretical results that also yielded analytical forms for the variance of such an estimate. We have stayed as close as possible to the problem formulation as laid out in the classical paper by Forsyth (1985) but extended it by fully considering correlated initial loads, as suggested by McKenzie (2003). The significant complexity of this problem, even in a two-layer case, barred us from considering initial loads with anisotropic power spectral densities, wave vector-dependent initial-loading fractions and load-correlation coefficients, anisotropic flexural rigidities, or any other elaborations on the classical theory. However, we have suggested methods by which the presence of such additional complexity can be tested through residual inspection.
The principal steps in our algorithm are as follows. After collecting the Fourier-transformed observations (82) into a vector H•(k) we form the blurred Whittle likelihood of eq. (97) as the average over the K wavenumbers in the half plane, the Gaussian quadratic form wherebyS• is the blurred version, per eq. (84), of the spectral matrix formulated in eqs (76)-(78). The likelihood depends on the lithospheric parameters of interest, namely the flexural rigidity D, the initial-loading ratio f 2 , and the load-correlation coefficient r, and on the spectral parameters σ 2 , ν, ρ of the Matérn form (72) that captures the isotropic shape of the power spectral density of the initial loading. Maximization of eq. (150) then yields estimates of these six parameters. To appraise their covariance, we turn to the unblurred Whittle likelihood of eq. (100), its first derivatives (the score), its second derivatives (the Hessian), and their expectation (the Fisher matrix), whose inverse relates to the variance of the parameter estimates as With this knowledge we construct 100×α % confidence intervalŝ The problem of producing likely values of lithospheric strength, initial-loading fraction and load correlation for a geographic region of interest required positing an appropriate model for the relationship between gravity and topography. The gravity field had to be downward continued (to produce subsurface topography), and the statistical nature of the parameter recovery problem had to be acknowledged. There are many methods to produce estimators, and depending on what can be reasonably assumed, different estimators will result, all with different bias and variance characteristics. In general one wishes to obtain unbiased and asymptotically efficient estimators, i.e. estimators whose variance is competitive with any other method for increasing sample sizes. Our goal in this work has been to whittle down the assumptions, while keeping the model both simple and realistic.
If the parametric models that we have proposed are realistic then we are assured of good estimation properties. Maximum-likelihood estimators are both asymptotically unbiased and efficient (often with minimum variance, see, e.g., Portnoy 1977). Should we use another method, with more parameters, or even non-parametric nuisance terms, unless those extra components in the model are necessary, we will literally waste data points on estimating needless degrees of freedom, and accrue an increased variance. Modeling the initial spectrum nonparametrically is such an example, of wasting half of the data points on the estimation. Producing the coherence or admittance estimate as a starting point for a subsequent estimation of the lithospheric parameters of interest is also highly suboptimal, and for the same reason. If the parametric models that we have assumed are not realistic then we will be able to diagnose this problem from the residuals, and this will be a check on the methods we apply. Hence, if the parametric models stand up to tests of this kind, then because of the properties of maximum-likelihood estimators, asymptotically, no other estimator will be able to compete in terms of variance. In that case the confidence intervals that we have produced in this paper are the best that could be produced.

A C K N O W L E D G M E N T S
This work was supported by the U. S. National Science Foundation under grants EAR-0710860, EAR-1014606 and EAR-1150145, and by the National Aeronautics and Space Administration under grant NNX11AQ45G to F.J.S., by U. K. EPSRC Leadership Fellowship EP/I005250/1 to S.C.O. She thanks Princeton University and he thanks University College London for their hospitality over the course of many mutual visits. In particular also, F.J.S. thanks Theresa Autino and Debbie Fahey for facilitating his visit to London via Princeton University account 195-2243 in 2011, and S.C.O. thanks the Imperial College Trust for funding her sabbatical visit to Princeton in 2006, where and when this work was commenced. We acknowledge useful discussions with Don Forsyth, Lara Kalnins, Jon Kirby, Mark Wieczorek and Tony Watts, but especially with Dan McKenzie. Two anonymous reviewers and the Associate Editor, Saskia Goes, are thanked for their helpful suggestions, which improved the paper. All computer code needed to reproduce the results and the figures in this paper is made freely available on www.frederik.net.

The Hessian F and the Fisher matrix F
The Hessian or second derivative of the log-likelihood function (100), and its negative expectation or the Fisher information matrix, are given by the expressions (132) and (133), respectively. Both of these contain the terms (173)-(178) and (181)-(186) that we have just derived, which renders them eminently calculable analytically. In its raw form eq. (133) does not provide much insight, but in Section 4.6 we also introduced special formulations for elements of the Fisher matrix that involve at least one spectral variable, in which case the expressions (131), (134) and (135) for F θ S θ S , F θ L θ S and F θ S θ ′ S , respectively, are of a common form. We do not foresee needing the expressions for the Hessian: while optimization procedures might benefit from those, even in eq. (149) the Fisher matrix could be substituted (Cox & Hinkley 1974).
We are thus left with determining the entries of the Fisher matrix F θ L θ ′ L when only lithospheric variables are present. The diagonal terms F θ L θ L are obtained via eq. (130), which we repeat here specifically for this case as Only to obtain the cross terms involving different lithospheric parameters do we need the full expression (133). Even this case simplifies since, owing to eq. (72), ∂ θ L S11 = 0, thereby yielding the expression where we recall from eq. (111) that ∂ θ L A θ ′ L = ∂ θ L ∂ θ ′ L T −1 • . When θL = θ ′ L , as is seen from eqs (173)-(175), the first term ∂ θ L m θ ′ L = 0. When θL = θ ′ L , eqs (188)-(189) are exactly each others' equivalent, and either expression can be used. We will not really need the eigenvalues of the quadratic forms: their sums of squares (in eq. 188) or sums (in eq. 189) suffice to calculate the elements of the Fisher matrix. The specific eigenvalues are only required if we should abandon the normal approximations and develop an interest in calculating the distributions of eq. (117) exactly.
Beginning with the flexural rigidity, we obtain For the loading ratio, we obtain for the sum of squares of the eigenvalues Finally, for the load-correlation coefficient we conclude that For the cross terms that remain, we find, at last,

Properties of admittance and coherence estimates -and "Cramér-Rao lite" for the maximum-likelihood estimate
Let us consider how the uncertainty on the parametersθ estimated via the maximum-likelihood method propagates to estimates of the coherence and the admittance, γ 2 • and Q•, should we desire to construct those. Since Section 4.7 we have known that our estimateθ, which is based on the likelihood (97) and thus ultimately on the data H•(k), is centered on the truth θ0 as per θ = θ0 + Y, and θ = θ0.
We know the distributional properties of Y as having a mean of zero and a variance that is proportional to the inverse of the Fourier-domain sample size K. Taking the Bouguer-topography coherence as an example, we can again use the delta method to write for its estimate from which easily follows that var{γ 2 • (θ)} = ∇γ 2 • (θ0) T var{Y} ∇γ 2 • (θ0) , at identical wavenumbers k, and a statement similar in form to eq. (199) for the covariance of the coherence estimate between different wavenumbers k and k ′ . With these we know the relevant statistics of maximum-likelihood-based admittance and coherence estimates. The "traditional" methods use estimates of coherence and admittance to derive estimates of the parameters θ. Regardless of how the former are computed (via parameterized maximum-likelihood techniques as in this paper, or non-parametrically using multitaper or other spectral techniques), we know one important thing about their statistics. No alternative estimate for the parameters that is unbiased will beat the variance of our maximum-likelihood estimate.
Let us imagine defining another unbiased estimator which would be given by another function of the data, generically written t, where t = θ0, since the first term, the pseudocovariance or relation matrix vanishes in the half-plane for the complex-proper Gaussian Fourier coefficients (Miller 1969;Thomson 1977;Neeser & Massey 1993) of real-valued stationary variables. We may thus conclude that the covariance of the scores suffers mildly from wavenumber correlation, However, for very large observation windows or custom-designed tapering procedures, we may write cov (γS) θ S , (γS) θ ′ From eq. (128) we then also recover the entries of the Fisher matrix for this problem as exactly half the size of the multivariate equivalent that we obtained in eq. (135), as expected, which are to be used in the construction of confidence intervals for the parameters σ 2 , ρ and ν of the isotropic Matérn distribution as determined by this procedure. The expressions for m θ S were listed in Appendix 9.3. Refer again also to Table 2, which we have only now completed filling.

Testing correlation via the likelihood-ratio test
We seek to evaluate the null and alternative hypotheses H0 : r = 0 versus H1 : r = 0.
Our definition of the log-likelihood L(θ) in eq. (100) included the correlation coefficient r between initial-loading topographies as a parameter to be estimated from the data. In contrast, the log-likelihoodL(θ) = L([θ T 0] T ) of eq. (101) did not. The Hessian of L is F and that ofL isF, and from eq. (109) we know that F converges in probability to the negative Fisher matrix −F and, similarly,F converges to the constant −F. This gives us the elements to evaluate the different scenarios. Should we evaluate "uncorrelated data" using a "correlated model", we need a significance test for the addition of the correlation parameter. Since the hypotheses (217) refer to nested models,θ containing some of the same entries as θ, see eqs (74)-(75), otherwise put standard likelihood-ratio theory (Cox & Hinkley 1974) applies. Let the truth under H0 be given by the parameter vector and let us consider having found two maximum-likelihood estimates, θu = arg maxL(θ) =θ.
Note that L(θ) ≤L(θ) and and that the estimates of 'everything-but-the-correlation-coefficient' are different from the full estimates depending on whether the correlation coefficient is included as a parameter to be estimated or not. We now define the maximum-log-likelihood ratio statistic from the evaluated likelihoods whereby we have used that, evaluated at the truth under H0, the likelihood valuesL(θ0) = L(θ0), and defined the auxiliary quantities X1 = 2K L(θ) − L(θ0) , and X2 = 2K L (θu) −L(θ0) .
By Taylor expansion of the log-likelihoods around the truth, to second order and with the first-order derivatives vanishing, we then have where we have used the limiting behavior (109). For more generality, we consider maximum-likelihood problems with a partitioned parameter vector whereby θ× may contain any number of extra parameters, θ× = [r] being the case under consideration. Introducing notation as we go along, the Fisher matrix for such problems partitions into four blocks (see also Kennett et al. 1998) such that we can write,

Maximum-likelihood estimation of flexural rigidity 41
The submatrices F× and F• contain the negative expectations of the second derivatives of the likelihood L, with respect to at least one of the 'extra' parameters θ× ∈ θ×, suitably arranged with the mnemonic subscripts × and •. The corner matricesF andF contain the second derivatives of the likelihoodL in only the 'simpler' subset of parametersθ ∈θ. The inverse of the Fisher matrix is given by thereby defining the auxiliary matrices, and, via the Woodbury identity, their inverses, as This yields the variances of the vectors partitions. Recalling from eq. (140) that √ K(θ − θ0) ∼ N (0, F −1 (θ0)), we may use eqs (227)-(228) to express the marginal distribution of the partitionθ× under the null hypothesis, √ Kθ× ∼ N (0, F −1 •× (θ0)).
In this general framework we rewrite likelihood-ratio statistic (222) with the help of eqs (224)-(225) as In order to figure out the properties of the likelihood-ratio test we now need to understand the properties of the difference between the 'correlated' and 'uncorrelated' estimatesθ −θu of eqs (220)-(221). We may note directly from Cox & Hinkley (1974) that Inserting this relation into eq. (232) the limiting behavior of the likelihood-ratio test statistics becomes X ≈ − √ Kθ