The debiased Whittle likelihood

The Whittle likelihood is a widely used and computationally efﬁcient pseudolikelihood. How-ever, it is known to produce biased parameter estimates with ﬁnite sample sizes for large classes of models. We propose a method for debiasing Whittle estimates for second-order stationary stochastic processes. The debiased Whittle likelihood can be computed in the same O ( n log n ) operations as the standard Whittle approach. We demonstrate the superior performance of our method in simulation studies and in application to a large-scale oceanographic dataset, where in both cases the debiased approach reduces bias by up to two orders of magnitude, achieving estimates that are close to those of the exact maximum likelihood, at a fraction of the computational cost. We prove that the method yields estimates that are consistent at an optimal convergence rate of n − 1 / 2 for Gaussian processes and for certain classes of non-Gaussian or nonlinear processes. This is established under weaker assumptions than in the standard theory, and in particular the power spectral density is not required to be continuous in frequency. We describe how the method can be readily combined with standard methods of bias reduction, such as tapering and differencing, to further reduce bias in parameter estimates.


Introduction
This paper introduces an improved computationally efficient method of estimating time series model parameters for second-order stationary processes. The standard approach is to maximize the exact time-domain likelihood, which in general has computational efficiency of order n 2 Debiased Whittle likelihood 3

Definitions and notation
We shall assume that the stochastic process of interest is modelled in continuous time; however, the debiased Whittle likelihood can be readily applied to discrete-time models, as we will demonstrate later. We define {X t } to be the infinite sequence obtained from sampling a continuous-time real-valued process X (t; θ) with zero mean or a known nonzero mean such that it can be subtracted a priori, where θ is a length-p vector that specifies the process. That is, we let X t ≡ X (t ; θ), where t is a positive or negative integer (t = . . . , −2, −1, 0, 1, 2, . . .) and > 0 is the sampling interval. If the process is second-order stationary, we define the autocovariance sequence by s(τ ; θ) ≡ E{X t X t−τ } for τ = . . . , −2, −1, 0, 1, 2, . . . , where E{·} is the expectation operator. The power spectral density of {X t } forms a Fourier pair with the autocovariance sequence, and is almost everywhere given by As {X t } is a discrete sequence, its Fourier representation is only defined up to the Nyquist frequency ±π/ . Thus, there may be departures between f (ω; θ) and the continuous-time process spectral density, denoted byf (ω; θ), which for almost all ω ∈ R is given bỹ Heres(λ; θ) ≡ E{X (t)X (t − λ)} (λ ∈ R) is the continuous-time process autocovariance, which is related to s(τ ; θ) vias(τ ; θ) = s(τ ; θ) when τ is an integer. It follows that see Percival & Walden (1993). Thus, contributions tof (ω; θ) outside the range of frequencies ±π/ are said to be folded or wrapped into f (ω; θ). We have defined both f (ω; θ) andf (ω; θ), as both quantities are important in separating aliasing from other artefacts in spectral estimation. In addition to these theoretical quantities, we will also require certain quantities that are computed directly from a single length-n sample {X t } n t=1 . A widely used but inconsistent estimate of f (ω; θ) is the periodogram, denoted by I (ω), which is the squared absolute value of the discrete Fourier transform: Note that I (ω) and J (ω) are taken to be properties of the observed realization and are formally not regarded as functions of θ .

Maximum likelihood and the Whittle likelihood
Consider the discrete sample X = {X } n t=1 , which is organized as a length-n column vector. Under the assumption that X is drawn from X (t; θ), the expected n × n autocovariance matrix is 4 A. M. Sykulski et al. C(θ) ≡ E{XX T }, where T denotes transpose, and the components of C(θ) are given by C ij (θ) = s (i−j; θ). Exact maximum likelihood inference can be performed for Gaussian data by evaluating the loglikelihood function (Brockwell & Davis, 1991, p. 254), where −1 denotes matrix inverse and |C(θ)| is the determinant of C(θ). In (4) we have omitted additive and multiplicative constants not affected by θ . The optimal choice of θ for our chosen model to characterize the sampled time series X is then found by maximizing the loglikelihood function in (4) so thatθ where defines the parameter space of θ. Because the time-domain maximum likelihood is known to have optimal properties, any other estimator will be compared with the properties of this quantity.
A standard technique for avoiding expensive matrix inversions is to approximate (4) in the frequency domain, following the seminal work of Whittle (1953). In this approach C(θ) is approximated using a Fourier representation, and special properties of Toeplitz matrices are utilized. Given the observed sampled time series X , the Whittle likelihood, denoted by W (θ), is where is the set of discrete Fourier frequencies, We have presented the Whittle likelihood in a discretized form here, as its usual integral representation must be approximated for finite-length time series. The Whittle likelihood approximates the time-domain likelihood, i.e., (θ) ≈ W (θ), and this statement can be made precise (Dzhaparidze & Yaglom, 1983). Its computational efficiency is O(n log n) as the periodogram can be computed using the fast Fourier transform, thus explaining its popularity in practice. Exact maximum likelihood, on the other hand, would require O(n 2 ) computations for regularly sampled Gaussian processes, using for example the Trench algorithm (Trench, 1964) to compute (4), and often has higher complexity for non-Gaussian processes, as demonstrated by our non-Gaussian simulation example in the Supplementary Material. The Whittle likelihood (5) is calculated using the periodogram, I (ω). This spectral estimate, however, is known to be a biased measure of the continuous-time process's spectral density for finite samples, due to blurring and aliasing effects (Percival & Walden, 1993). Aliasing results from the discrete sampling of the continuous-time process to generate an infinite sequence, whereas blurring is associated with the truncation of this infinite sequence over a finite-time interval. The desirable properties of the Whittle likelihood rely on the asymptotic behaviour of the periodogram for large sample sizes. The bias of the periodogram for finite samples, however, will translate into biased parameter estimates from the Whittle likelihood, as has been widely reported (e.g., Dahlhaus, 1988). In the next section we propose a procedure for debiasing Whittle estimates.

The debiased Whittle likelihood
We introduce the pseudolikelihood function where a subscript D stands for debiased. Heref (ω; θ) in (5) has been replaced byf n (ω; θ) ≡ E{I (ω)}, which is the expected periodogram, and can be shown to be given by the convolution of the true modelled spectrum with the Fejér kernel F n, (ω). We call (7) the debiased Whittle likelihood, where the set is defined as in (6).
Replacing the true spectrumf (ω; θ) with the expected periodogramf n (ω; θ) in (7) is a straightforward concept; however, our key innovation lies in formulating its efficient computation without losing O(n log n) efficiency. If we directly use (8), then this convolution would usually need to be approximated numerically and could be computationally expensive. Instead we utilize the convolution theorem to express the frequency-domain convolution as a time-domain multiplication (Percival & Walden, 1993, p. 198), such that where Re{·} denotes the real part. Thereforef n (ω; θ) can be exactly computed at each Fourier frequency directly from s(τ ; θ), provided its functional form is known for τ = 0, . . . , n − 1, by using a fast Fourier transform in O(n log n) operations. Care must be taken to subtract the variance term, × s(0; θ), to avoid double counting contributions from τ = 0. Both aliasing and blurring effects are automatically and conveniently accounted for in (9) in one operation; aliasing is accounted for by sampling the theoretical autocovariance function at discrete times, while the effect of blurring is accounted for by the truncation of this sequence to finite length and the inclusion of the triangle function (1 − τ/n) in the expression. The result is thatf n (ω; θ) is a blurred and aliased version of the true spectrumf (ω; θ), which reflects the blurring and aliasing artefacts present in the periodogram. The debiased Whittle likelihood can also be used with discrete-time processes, as (9) can be computed from the theoretical autocovariance sequence of the discrete process in exactly the same way. If the analytical form of s(τ ; θ) is unknown or expensive to evaluate, then it can be approximated from the spectral density using fast Fourier transforms, thus maintaining O(n log n) computational efficiency.
Computing the standard Whittle likelihood of (5) with the aliased spectrum f (ω; θ) defined in (1), without accounting for spectral blurring, would in general be more complicated than using the expected periodogramf n (ω; θ). This is because the aliased spectrum f (ω; θ) seldom has an analytical form for continuous processes, and must instead be approximated either by explicitly wrapping in contributions fromf (ω; θ) from frequencies higher than the Nyquist frequency as in (2), or via an approximation to the Fourier transform in (1). This is in contrast to the debiased Whittle likelihood, where the effects of aliasing and blurring have been computed exactly in a single operation using (9). Thus, addressing aliasing and blurring together by means of the 6 A. M. Sykulski et al. debiased Whittle likelihood is simpler and computationally faster to implement than accounting for aliasing alone. This will become further apparent in the simulation studies of § 5.1.

Combining with differencing or tapering
A standard technique for reducing the effects of blurring on Whittle estimates is to apply the Whittle likelihood to the differenced process (Velasco & Robinson, 2000), as often differencing will decrease the dynamic range of the spectrum and hence decrease broadband blurring. The debiased Whittle likelihood can be readily implemented with differenced data. If we denote the differenced process by U t = X t+1 − X t , then the expected periodogram of (9) can be computed using the autocovariance of U t , which is found from the autocovariance of Another standard approach to ameliorating the effects of blurring is to premultiply the data sequence by a weighting function known as a data taper (Thomson, 1982). The taper is chosen to have spectral properties such that broadband blurring will be minimized, and the variance of the spectral estimate at each frequency is reduced, although the trade-off is that tapering increases narrowband blurring as the correlation between neighbouring frequencies increases.
The tapered Whittle likelihood (Dahlhaus, 1988) corresponds to replacing the direct spectral estimator formed from I (ω) in (3) with one using the taper h = {h t }: where h t is real-valued. Setting h t = 1/n 1/2 for t = 1, . . . n recovers the periodogram estimate of (5). To estimate parameters, we then maximize where a subscript T indicates that a taper has been used. Velasco & Robinson (2000) demonstrated that for certain discrete processes it is beneficial to use this estimator, rather than the standard Whittle likelihood, for parameter estimation, particularly when the spectrum exhibits a high dynamic range. Nevertheless, tapering will not remove all broadband blurring effects in the likelihood, because we are still comparing the tapered spectral estimate against the theoretical spectrum, and not against the expected tapered spectral estimate. Furthermore, there remain the issues of narrowband blurring and aliasing effects with continuous sampled processes. Our debiasing procedure can be naturally combined with tapering. We define the pseudolikelihood with I (ω; h) as defined in (10) such thatf n (ω; h, θ) ≡ E{I (ω; h)}. We call TD (θ) the debiased tapered Whittle likelihood andf n (ω; h, θ) the expected tapered spectral estimate. The function f n (ω; h, θ) can be computed exactly using a O(n log n) calculation similar to (9); that is, Accounting inf n (ω; h, θ) for the particular taper used accomplishes debiasing of the tapered Whittle likelihood, just as using the expected periodogram does for the standard Whittle likelihood. The time-domain kernel n−τ t=1 h t h t+τ (τ = 0, . . . , n − 1) can be pre-computed using fast Fourier transforms or using a known analytical form. Then, during optimization, a fast Fourier transform of this fixed kernel multiplied by the autocovariance sequence is taken at each iteration. Thus the debiased tapered Whittle likelihood is also an O(n log n) pseudolikelihood estimator.
Both the debiased Whittle and the debiased tapered Whittle likelihoods have their merits, but the trade-offs are different with nonparametric spectral density estimation than with parametric model estimation. Specifically, although tapering decreases the variance of nonparametric estimates at each frequency, it conversely may increase the variance of estimated parameters. This is because the taper reduces the degrees of freedom in the data, which increases correlations between local frequencies. On the other hand, the periodogram creates broadband correlations between frequencies, especially for processes with high dynamic range, which also contributes to variance in parameter estimates. We explore these trade-offs in greater detail in § 5.1.

The Matérn process
In this section we investigate the performance of the debiased Whittle likelihood in a Monte Carlo study using observations from a Matérn process (Matérn, 1960), as motivated by the simulation studies of Anitescu et al. (2012), who investigated the same process. The Matérn process is a three-parameter continuous Gaussian process defined by its continuous-time unaliased spectrum The parameter A 0 determines the magnitude of the variability, 1/c > 0 is the damping time-scale, and α > 1/2 controls the rate of spectral decay, or equivalently the smoothness or differentiability of the process. For large α the power spectrum exhibits a high dynamic range, and the periodogram will be a poor estimator of the spectral density due to blurring. Conversely, for small α there will be departures between the periodogram and the continuous-time spectral density because of aliasing. We therefore investigate the performance of estimators over a range of α values, and this is the reason why the Matérn process is a suitable process to study.
In Table 1 we display the average percentage bias, standard deviation and root mean square error, relative to the true parameter values, for six different pseudolikelihoods: the standard Whittle likelihood, (5), with the observed and the differenced processes; the tapered Whittle likelihood (11); and the debiased versions, (7) and (12). Our choice of data taper is the discrete prolate spheroidal sequence taper (Slepian & Pollak, 1961), with bandwidth parameter 4, for which performance was found to be broadly similar across different choices of bandwidth. We also include results for the exact maximum likelihood, (4). The results are averaged over estimates of the three parameters {A, α, c}, all assumed to be unknown, where the true α varies from 0.6 to 2.5 8 A. M. Sykulski et al. in increments of 0.1 and we fix A = 1 and c = 0.2. This is to explore performance over spectra that have aliasing artefacts as well as high dynamic range. For each value of α, we simulate 10 000 time series, each of length n = 1000. The optimization is performed in Matlab using fminsearch, under identical settings for all likelihoods. Initialized guesses for the slope and amplitude are found using a least squares fit in the range [π/4 , 3π/4 ], and the initial guess for the damping parameter c is set at a mid-range value of 100 times the Rayleigh frequency, i.e., c = 100π/n = π/10. All standard Whittle methods have performance significantly contaminated by bias. The debiased variants decrease this bias by an order of magnitude. The standard deviation is broadly similar across all Whittle methods, and tapering results in standard deviations that are approximately twice that of maximum likelihood, which is consistent with the loss of information from using a data taper. Of all the pseudolikelihood estimators considered, the debiased Whittle likelihood using the differenced process performs best, and yields results close to the exact maximum likelihood. Overall, of the three modifications to the standard Whittle likelihood, debiasing, tapering and differencing, the debiasing method proposed here is the procedure that yields the greatest overall improvement in parameter estimation.
In the Supplementary Material, we present a figure which separates out the bias and root mean square error improvements over different values of α, demonstrating that the debiased Whittle likelihood can effectively reduce bias from aliasing when α is low and bias from blurring when α is high. In the Supplementary Material we also include a comparison with a time-domain O(n log n) estimator from Anitescu et al. (2012), which is found to perform similarly to the debiased Whittle likelihood with differenced data in terms of bias and root mean square error, although the latter method requires only a fraction of the computational time.
In the next section we will prove that the debiased Whittle likelihood is a consistent estimator converging at the optimal n −1/2 rate, under assumptions which are satisfied by the Matérn process. Motivated by this, we perform an additional experiment over different lengths of time series n = 2 k , with k taking integer values from 7 to 13, so that n ranges from 128 to 8192. To isolate the convergence of the parameter estimate, we fix A = 1 and c = 0.2 as before but this time assume that these are known, and now only estimate the slope parameter, which we set to α = 2.
The average bias, standard deviation and root mean square error of each estimator are plotted in Fig. 1, together with average CPU times. We show results for the standard and debiased Whittle likelihoods, as well as the exact maximum likelihood. Motivated by Table 1, we include the debiased Whittle likelihood with differenced data. Finally, we also report results for the standard Whittle likelihood using an approximated aliased spectrum, which we find using (2) by truncating the summation limits to ±5 to keep the computation efficient. The reason for including an approximate aliased version of the standard Whittle likelihood is to show that bias corrections are made by the debiased Whittle likelihood with regard to both blurring and aliasing. Here we see that the standard Whittle likelihood using the unaliased spectrum of (13) performs poorly with increasing n due to bias; this is because for growing domain asymptotics, bias due to aliasing does not decrease as n increases.
The standard deviations of all estimates converge at a rate consistent with n −1/2 , as Theorem 1 in § 6.1 will establish for debiased methods. Overall, the debiased approaches provide a good balance between statistical and computational efficiency over all sample sizes, whereas, in contrast, exact maximum likelihood is computed only up to n = 2048 due to rapidly increasing computational costs. For clarity of presentation the standard Whittle likelihood with differenced data is not included but was found to perform worse than the debiased Whittle likelihood with differenced data, consistent with the results in Table 1.

Application to large-scale oceanographic data
In this subsection we examine the performance of our method when applied to a realworld large-scale dataset, by analysing data obtained from the Global Drifter Program   http://www.aoml.noaa.gov/phod/dac/index.php, which maintains a publicly downloadable database of position measurements taken by freely drifting satellite-tracked oceanic instruments known as drifters. In total more than 23 000 drifters have been deployed, with interpolated six-hourly data available since 1979 and one-hourly data since 2005 (Elipot et al., 2016); over 100 million data points are available in total. Such data are pivotal to the understanding of ocean circulation and its impact on the global climate system (Griffa et al., 2007); it is therefore essential to have computationally efficient methods for their analysis. In Fig. 2 we display 50-day position trajectories and corresponding velocity time series for three drifters from the one-hourly dataset, each from a different major ocean. These trajectories can be considered as complex-valued time series, with the real part corresponding to the east/west velocity component and the imaginary part corresponding to the north/south velocity component. We then plot the periodogram of the complex-valued series, which has different powers at positive and negative frequencies, distinguishing directions of rotation on the complex plane (Schreier & Scharf, 2010). The debiased Whittle likelihood for complex-valued proper processes is exactly the same as (7)-(9), see also Sykulski et al. (2016), where the autocovariance sequence of a complex-valued process Z t is s(τ ; θ) = E(Z t Z * t−τ ). For proper processes the complementary covariance is r(τ ; θ) = E(Z t Z t−τ ) = 0 at all lags (Schreier & Scharf, 2010) and can therefore be ignored in the likelihood, as s(τ ; θ) captures all the second-order structure in the zero-mean process. We model the velocity time series as a complex-valued Matérn process, with power spectral density given in (13), as motivated by Sykulski et al. (2016) and Lilly et al. (2017). To account for a type of circular oscillations in each time series known as inertial oscillations, which create an off-zero spike on one side of the spectrum, we fit the Matérn process semiparametrically to the opposite, non-inertial side of the spectrum, represented by the solid red lines in the bottom row of Fig. 2; we overlay the fit of the debiased Whittle likelihood on the periodograms in Fig. 2. For a full parametric model of surface velocity time series, see Sykulski et al. (2016). We have selected drifters without noticeable tidal effects; for detiding procedures see Pawlowicz et al. (2002).
We estimate the Matérn parameters for each time series using the debiased and regular Whittle likelihoods, as well as the exact maximum likelihood. The last of these methods can be performed over only positive or negative frequencies by first decomposing the time series into analytic and anti-analytic components using the discrete Hilbert transform (see Marple, 1999) and then fitting the corresponding signal to an adjusted Matérn autocovariance that accounts for the effects of the Hilbert transform. The details of this procedure are provided in the Supplementary Material.
The parameter estimates from the three likelihoods are displayed in Table 2 along with the corresponding CPU times. We also report the estimated parameter standard errors using the method described in § 6.2, with more details in the online code. We reparameterize the Matérn process to output three important oceanographic quantities: the damping time-scale, the decay rate of the spectral slope, and the diffusivity, which is the rate of particle dispersion, κ ≡ A 2 /4c 2α (Lilly et al., 2017, equation (43)). From Table 2 it can be seen that the debiased Whittle and maximum likelihoods yield similar values for the slope and damping time-scale; however, regular Whittle likelihood yields parameters that underestimate the slope by around 15% and overestimate the damping time-scale by a factor of two, which if used would incorrectly specify underdamped and rougher-than-expected trajectories. These biases are consistent with the significant biases discussed in § 5.1. Diffusivity estimates vary across all estimation procedures and have large standard errors; this variability is likely due to the fact that diffusivity is a measure of the spectrum at frequency zero, and hence estimation is performed over relatively few frequencies.
The maximum likelihood is two orders of magnitude slower to execute than the debiased Whittle likelihood. When this is scaled to fitting all time series in the Global Drifter Program database, 12 A. M. Sykulski et al. time-domain maximum likelihood becomes impractical. The debiased Whittle likelihood, on the other hand, retains the speed of the Whittle likelihood while returning estimates that are close to the maximum likelihood. The results of this subsection therefore serve as a proof of concept of how the debiased Whittle likelihood is a useful tool for efficient parameter estimation from large datasets.

Autoregressive processes
We now investigate the performance of the debiased Whittle likelihood in estimating parameters of a discrete-time autoregressive process, X t = p k=1 φ k X t−k + t , where t are independent and identically distributed as N (0, σ 2 ). Specifically, we generate time series from the ar(4) autoregressive process studied in Percival & Walden (1993), used throughout the book as a motivating example of a process that generates high spectral blurring in spectral density estimation. As the process is in discrete-time, there is no issue with aliasing, so this example assesses how well the debiased Whittle likelihood accounts for bias that is purely due to blurring.
In Table 3 we display the average parameter estimates and root mean square errors in estimating all five parameters of the ar(4) process {φ 1 , φ 2 , φ 3 , φ 4 , σ }. We compare four approaches: maximum likelihood, the standard Whittle likelihood, the debiased Whittle likelihood, and the standard Yule-Walker estimation procedure, which is used to initialize parameter estimates for the likelihood-based methods. We do not include results obtained using the differenced process, as it was not found to give improved parameter estimates for this particular example.
The Yule-Walker and standard Whittle estimates perform similarly and quite poorly for both sample sizes considered, which is consistent with the fact that the former uses the biased sample autocovariance to solve the Yule-Walker equations, while the latter uses the periodogram, which Debiased Whittle likelihood 13 is the Fourier pair of the biased sample autocovariance. The debiased Whittle likelihood accounts for this bias and yields average estimates that are close to the exact maximum likelihood and the true values; it eliminates around half the root mean square error when n = 256, and two thirds when n = 1024. The debiased Whittle likelihood is therefore an effective pseudolikelihood for discrete-time as well as continuous-time processes. In the Supplementary Material we report further simulation results, including a performance comparison for a non-Gaussian process, where again the debiased Whittle likelihood is found to provide a good trade-off between statistical and computational efficiency.
6. Properties of the debiased Whittle likelihood 6.1. Consistency and optimal convergence rates We now establish consistency and optimal convergence rates for debiased Whittle estimates with Gaussian and certain classes of non-Gaussian or nonlinear processes. To show that debiased Whittle estimates converge at the optimal rate, the main challenge is that although our pseudolikelihood accounts for the bias of the periodogram, correlation between different frequencies caused by the leakage associated with the Fejér kernel is still present. This is what prevents the debiased Whittle likelihood from being exactly equal to the time-domain maximum likelihood for Gaussian data. To establish optimal convergence rates, we bound the asymptotic behaviour of this correlation. The statement is provided in Theorem 1, with the proof given in the Supplementary Material. The proof is composed of several lemmas, which, for example, place useful bounds on the expected periodogram, the variance of linear combinations of the periodogram at different frequencies, and also the first and second derivatives of the debiased Whittle likelihood. Together these establish that the debiased Whittle likelihood is a consistent estimator with estimates that converge in probability at an optimal rate of n −1/2 , under relatively weak assumptions.
Theorem 1. Assume that {X t } is an infinite sequence obtained from sampling a zero-mean continuous-time real-valued process X (t; θ) which satisfies the following assumptions: (i) the parameter set ⊂ R p is compact with a nonnull interior, and the true length-p parameter vector θ lies in the interior of ; (ii) for all θ ∈ and ω ∈ [−π, π], the spectral density of the sequence {X t } is bounded below by f (ω; θ) f min > 0 and bounded above by f (ω; θ) f max ; (iii) θ 1 | = θ 2 implies f (· ; θ 1 ) | = f (· ; θ 2 ) on a set of positive Lebesgue measure; (iv) f (ω; θ) is continuous in θ and Riemann-integrable in ω; (v) the expected periodogramf n (ω; θ), as defined in (9), has two continuous derivatives in θ which are bounded above in magnitude uniformly for all n, and the first derivative in θ also has (n) frequencies in that are nonzero; (vi) {X t } is a fourth-order stationary process with finite fourth-order moments and absolutely summable fourth-order cumulants.
The fourth-order cumulant is formally defined in the Supplementary Material. All stationary Gaussian processes automatically satisfy assumption (vi) as the fourth-order cumulant is identically zero. In the Supplementary Material we consider a class of nonlinear processes and prove that it satisfies assumption (vi). Specifically, we study the process Y t = X 2 t where X t is a Gaussian process with bounded spectral density and absolutely summable autocovariance.

Standard error estimation
Here we present a novel method of obtaining standard error estimates for debiased Whittle estimates. This method was used to calculate standard errors for our application example in Table 2. In the Supplementary Material we show that the p × p covariance matrix of the estimated vectorθ satisfies where ∇ = (∂/∂θ 1 ∂/∂θ 2 . . . ∂/∂θ p ) T and ∇ D (θ) is known as the score. The p×p matrix H (θ), known as the Hessian, is defined entrywise by H ij (θ) = ∂ 2 D (θ)/(∂θ i ∂θ j ), and its expectation can be approximated either analytically or numerically atθ. The remaining term in (14), var{∇ D (θ)}, is the p × p covariance matrix of the score. The diagonal elements of this matrix, which are variances of individual components of the score, can be expressed using (7) as where w j are the elements of defined in (6) and .
Here we have made use of the fact that the ∂ log{f n (ω; θ)}/∂θ term is deterministic and therefore has no variance. As we have established asymptotic efficiency forθ, we can now use the invariance principle of maximum likelihood estimators (Casella & Berger, 2002, p. 320) to construct an estimator of the variance, var ∂ ∂θ i D (θ) = n j=1 n k=1â ij (θ)â ik (θ)ĉov{I (ω j ), I (ω k )}, and by the same reasoning we can approximateâ ij (θ) by a ij (θ). Then, to estimate the covariance of the periodogram, we computê where the asterisk denotes the complex conjugate and D n (ω) is the noncentred Dirichlet kernel defined by D n (ω) ≡ sin (nω/2) sin (ω/2) exp{−i ω(n + 1)/2}, so that we arrive at estimates of the diagonal elements of var{∇ D (θ)}. Estimates of cov[∂/∂θ i { D (θ)}, ∂/∂θ j { D (θ)}], which are the off-diagonal terms of var{∇ D (θ)}, can be found in the same way. Then, substituting all estimated entries of var{∇ D (θ)} into (14), along with the estimate of the Hessian, gives estimates of the variance of the estimators.

Discussion
Standard theory shows that standard Whittle estimates are consistent with optimal convergence rates if the spectrum and its first and second partial derivatives in θ are continuous in ω and bounded from above and below (Dzhaparidze & Yaglom, 1983), as well as being twice continuously differentiable in θ. In contrast, we do not require that the spectrum or its derivatives be continuous in ω, so Theorem 1 will hold for discontinuous spectra, as long as the other assumptions, such as Riemann integrability, are satisfied. As detailed in the Supplementary Material, this is possible because the expectation of the score is zero after debiasing, which would not be the case for the standard Whittle likelihood; so we only need to consider the variance of the score and the Hessian. To control these variances, we make repeated use of a bound on the variance of linear combinations of the periodogram, a result previously established in Theorem 3.1 of Giraitis & Koul (2013) under a different set of assumptions.
It can easily be shown that the assumptions in Theorem 1 are weaker than standard Whittle assumptions, despite conditions on the behaviour of the expected periodogramf n (ω; θ) in assumption (v). This is because if the spectral density f (ω; θ) and its first and second partial derivatives in θ are continuous in both ω and θ, then it can be shown, by applying the Leibniz integration rule to the first and second derivatives of (8) with respect to θ, that f (ω; θ) being twice continuously differentiable in θ implies thatf n (ω; θ) is twice continuously differentiable in θ. To show this, we make use of Proposition 3.1 in Stein & Shakarchi (2003), which states that the convolution of two integrable and periodic functions is itself continuous. This result can also be used to show that f n (ω; θ) is always continuous in ω, even if f (ω; θ) is not, as from (8) we see thatf n (ω; θ) is the convolution of f (ω; θ) and the Fejér kernel, two functions which are integrable and 2π -periodic in ω. Therefore, not only doesf n (ω; θ) remove bias from blurring and aliasing and is computationally efficient to evaluate, but it also has desirable theoretical properties leading to consistency and optimal convergence rates of debiased Whittle estimates under weaker assumptions.